Skip to main content

Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Abstract

The majority of Arabic text available on the web is written without short vowels (diacritics). Diacritics are commonly used in religious scripts such as the holy Quran (the book of Islam), Al-Hadith (the teachings of Prophet Mohammad (PBUH)), children’s literature, and in some words where ambiguity of articulation might arise. Internet Arabic users might lose credible sources of Arabic text to be retrieved if they could not match the correct diacritical marks attached to the words in the collection. However, typing the diacritical marks is very annoying and time consuming. The other way around, is to ignore these marks and fall into the problem of ambiguity. Previous work suggested pre-processing of Arabic text to remove these diacritical marks before indexing. Consequently, there are noticeable discrepancies when searching the web for Arabic text using international search engines such as Google and yahoo. In this article, we propose a framework to enhance the retrieval effectiveness of search engines to search for diacritic and diacritic-less Arabic text through query expansion techniques. We used a rule-based stemmer and a semantic relational database compiled in an experimental thesaurus to do the expansion. We tested our approach on the scripts of the Quran. We found that query expansion for searching Arabic text is promising and it is likely that the efficiency can be further improved by advanced natural language processing tools.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. http://www.google.com.

  2. http://www.yahoo.com.

  3. http://www.msn.com.

  4. http://www.ayna.com.

  5. http://www.internetworldstats.com.

  6. http://www.isoc.org.

Abbreviations

AIR:

Arabic information retrieval

AWN:

Arabic wordNet

PBUH:

Peace be upon him

CLIR:

Cross language information retrieval

IE:

Information extraction

IR:

Information retrieval

LSI:

Latest semantic indexing

MSA:

Modern standard Arabic

MT:

Machine translation

NLP:

Natural language processing

POST:

Part of speech tagging

QA:

Question answering

QARAB:

Question answering system for Arabic

RDBMS:

Relational data base management system

QE:

Query expansion

SQL:

Structured query language

TS:

Text summarization

VR:

Verses retrieved

VRQ:

Verses relevant to query

VRS:

Verses retrieved using Stemmer

VRT:

Verses retrieved using Thesaurus

VRW:

Verses retrieved using words

VSM:

Vector space model

WSD:

Word sense disambiguation

References

  • Abdelali, A., Cowie, J., Farwell, D., Ogden, W., & Helmreich S. (2003). Cross-language information retrieval using ontology. In Proceedings of TALN ’2003, Batz-sur-Mer, France.

  • Abdelali, A., Cowie, J., & Soliman, H. (2004). Arabic information retrieval perspectives. In Proceedings of JEP-TALN 2004 Arabic Language Processing.

  • Abdelali, A., Cowie, J., & Soliman, H. (2006). Improving query expansion precision using latent semantic analysis: Application on Arabic retrieval. Journies d’Etudes sur le Traitement Automatique de la Langue Arabe (JETALA), Rabat, Morocco.

  • AbdulJaleel, N., & Larkey, L. (2003). Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 139–146.

  • Aljlayl, M., Frieder, O., & Grossman, D. (2002). On Arabic-English cross-language information retrieval: A machine translation approach. In Proceedings of the Third International Conference on Information Technology, pp. 2–7.

  • Al-Maskari, A., Sanderson, M., & Clough, P. (2007). Arabic users’ satisfaction with the online information as obtained from Google. In Proceedings of Sixth International Conference on Conceptions of Library and Information Science (CoLIS).

  • Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.

    Article  Google Scholar 

  • Black, W., & Elkateb, S. (2004). A prototype English-Arabic dictionary based on WordNet. In Proceedings of 2nd Global WordNet Conference, (GWC 2004), pp. 67–74.

  • Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, (GWC 2006), pp. 295–299.

  • Buckwalter, T. (2007). Issues in Arabic morphological analysis. In A. Soudi, A. Van den Bsch, & G. Neumann (Eds.), Arabic computational morphology (pp. 23–41). Netherlands: Springer. ISBN 978-1-4020-6045-8.

  • Debili, F., Achour, H., & Souissi, E. (2002). Del’etiquetage grammatical a’ la voyellation automatique de l’arabe. Correspondances (Vol. 71, pp. 10–28). Tunis: Institut de Recherche sur le Maghreb Contemporain.

    Google Scholar 

  • Diab, M. (2007). Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 89–96.

  • Dumais, S., Landauer, T., & Littman, M. (1996). Automatic cross-linguistic information retrieval using latent semantic indexing. In SIGIR’96-Workshop on Cross-Linguistic Information Retrieval, pp. 16–23.

  • Elkateb, S., & Black, W. (2004). A bilingual dictionary with enriched lexical information. In Proceedings of NEMLAR Cairo, Egypt 2004 Arabic Language Tools and Resources, pp. 79–84.

  • El-Helw, A., & Aly, H. (2004). An intelligent database application for the semantic web. In Proceedings of CSITeA-04 Conference, Cairo, Egypt.

  • El-Sadany, T., & Hashish, M. (1988). Semi-automatic vowelization of Arabic verbs. In 10th NC Conference, Jeddah, Saudi Arabia.

  • French, J., Powell, A., Gey, F., & Perelman, N. (2001). Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 199–206.

  • Gal, Y. (2002). An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 27–33.

  • Gey, F., Kando, N., & Peters, C. (2002). Cross language information retrieval: A research roadmap. ACM SIGIR Forum, 36(2), 72–80.

    Article  Google Scholar 

  • Grefenstette, G. (1996). Cross-linguistic information retrieval workshop. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in IR, p. 344.

  • Grefenstette, G., Semmar, N., & Elkateb-Gara, F. (2005). Modifying a natural language processing system for European languages to treat Arabic in information processing and information retrieval applications. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 31–38.

  • Hammo, B., Abu-Salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 55–65.

  • Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 379–415.

    Article  Google Scholar 

  • Hayashi, Y., Kikui, G., & Susaki, S. (1997). TITAN: A cross-linguistic search engine for the WWW. In Working Notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp. 58–65.

  • Hedlund, T., Airio, E., Keskustalo, H., Lehtokangas, R., Pirkola, A., & Järvelin, K. (2004). Dictionary based cross-language information retrieval: Learning experiences from CLEF 2000–2002. Information Retrieval, 7(1), 99–119.

    Article  Google Scholar 

  • Hull, D., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57.

  • Kampas, J. (2004). Improving retrieval effectiveness by reranking documents based on controlled vocabulary. Lecture Notes in Computer Science, 2997, 283–295.

    Article  Google Scholar 

  • Khadir, M. (2002). Quran lexicon. Retrieved April 10, 2008 from http://www.al-mishkat.com/words/book.htm.

  • Khoja, S. (1999). Stemming Arabic text. Retrieved June 20, 2007 from http://zeus.cs.pacificu.edu/shereen/research.htm.

  • Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL 2001, pp. 20–25.

  • Kirchhoff, K., & Vergyri, D. (2005). Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication, 46(1), 37–51.

    Article  Google Scholar 

  • Kubaisi, A. (2006). Quran words. Retrieved April 10, 2008 from http://www.islamiyyat.com/kalema.htm.

  • Landauer, T., & Littman, M. (1990). Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31–38.

  • Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in IR, pp. 275–282.

  • Larkey, L., & Connell, M. (2005). Structured queries, language modeling, and relevance modeling in cross-language information retrieval. Information Processing and Management: An International Journal, 41(3), 457–473.

    Article  MATH  Google Scholar 

  • Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.

    Article  Google Scholar 

  • Lundquist, C., Frieder, O., Holmes, D., & Grossman, D. (1997). A parallel relational database management system approach to relevance feedback in information retrieval. Journal of the American Society of Information Science (JASIS), 50(5), 413–426.

    Article  Google Scholar 

  • Moukdad, H. (2004). How do search engines handle Chinese queries? Lost in cyberspace: How do search engines handle arabic queries? In Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science. Retrieved October 1, 2008 from www.cais-acsi.ca/proceedings/2004/moukdad_2004.pdf.

  • Moukdad, H., & Cui, H. (2005). How do search engines handle Chinese queries? Webology, 2(3). Retrieved October 1, 2008 from www.Webology.ir/2005/v2n3/a17.html.

  • Oard, D. (1998). A comparative study of query and document translation for cross-language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pp. 472–483.

  • Oard, D. (2000). Evaluating interactive cross-language information retrieval: Document selection. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, pp. 57–71.

  • Pirkola, A., Hedlund, T., Keskustalo, H., & Järvelin, K. (2001). Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4(3–4), 209–230.

    Article  MATH  Google Scholar 

  • Qiu, Y., & Frei, H. (1993). Concept based query expansion. In Proceedings of the 16th ACM SIGIR International Conference on Research and Development in IR, pp. 160–169.

  • Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company.

    MATH  Google Scholar 

  • Salton, G. (1989). Automatic text processing—the transformation analysis and retrieval of information by computer. MA: Addison Wesley.

    Google Scholar 

  • Semmar, N., & Fluhr, C. (2007). Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 73–80.

  • Semmar, N., Laib, M., & Fluhr, Ch. (2006). Using stemming in morphological analyzer to improve Arabic information retrieval. In Proceedings of TALN 2006, pp. 317–327.

  • Sroka, M. (2000). Web search engines for Polish information retrieval: Questions of search capabilities and retrieval performance. International Information & Library Research, 32(2), 87–98.

    Article  Google Scholar 

  • Strzalkowski, T., & Vauthey, B. (1992). Information retrieval using robust natural language processing. In Proceedings of ACL-92, pp. 104–111.

  • Talvensaari, T., Juhola, M., Laurikkala, J., & Järvelin, K. (2007). Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research articles. Journal of the American Society for Information Science and Technology, 58(3), 322–334.

    Article  Google Scholar 

  • Vectomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 32(4), 324–333.

    Article  Google Scholar 

  • Virga, P., & Khudanpur, S. (2003). Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Vol. 15, pp. 57–64.

  • Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 269–274.

  • Zaidi, S., & Laskri, M. (2005). A cross-language information retrieval based on an Arabic ontology in the legal domain. In Proceedings of the International Conference on Signal-Image Technology and Internet-Based Systems (SITIS’05), pp. 86–91.

  • Zitouni, I., Sorensen, J., & Sarikaya R. (2006). Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics, pp. 577–584.

Download references

Acknowledgment

We would like to thank Shereen Khoja for providing her stemmer, Prof. Nadim Obeid for his valuable suggestions to improve this work and Mahmoud El-Hajj for helping with construction the thesaurus and the database implementation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bassam H. Hammo.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Hammo, B.H. Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Inf Retrieval 12, 300–323 (2009). https://doi.org/10.1007/s10791-008-9081-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10791-008-9081-9

Keywords