Abstract
The majority of Arabic text available on the web is written without short vowels (diacritics). Diacritics are commonly used in religious scripts such as the holy Quran (the book of Islam), Al-Hadith (the teachings of Prophet Mohammad (PBUH)), children’s literature, and in some words where ambiguity of articulation might arise. Internet Arabic users might lose credible sources of Arabic text to be retrieved if they could not match the correct diacritical marks attached to the words in the collection. However, typing the diacritical marks is very annoying and time consuming. The other way around, is to ignore these marks and fall into the problem of ambiguity. Previous work suggested pre-processing of Arabic text to remove these diacritical marks before indexing. Consequently, there are noticeable discrepancies when searching the web for Arabic text using international search engines such as Google and yahoo. In this article, we propose a framework to enhance the retrieval effectiveness of search engines to search for diacritic and diacritic-less Arabic text through query expansion techniques. We used a rule-based stemmer and a semantic relational database compiled in an experimental thesaurus to do the expansion. We tested our approach on the scripts of the Quran. We found that query expansion for searching Arabic text is promising and it is likely that the efficiency can be further improved by advanced natural language processing tools.
This is a preview of subscription content, access via your institution.










Abbreviations
- AIR:
-
Arabic information retrieval
- AWN:
-
Arabic wordNet
- PBUH:
-
Peace be upon him
- CLIR:
-
Cross language information retrieval
- IE:
-
Information extraction
- IR:
-
Information retrieval
- LSI:
-
Latest semantic indexing
- MSA:
-
Modern standard Arabic
- MT:
-
Machine translation
- NLP:
-
Natural language processing
- POST:
-
Part of speech tagging
- QA:
-
Question answering
- QARAB:
-
Question answering system for Arabic
- RDBMS:
-
Relational data base management system
- QE:
-
Query expansion
- SQL:
-
Structured query language
- TS:
-
Text summarization
- VR:
-
Verses retrieved
- VRQ:
-
Verses relevant to query
- VRS:
-
Verses retrieved using Stemmer
- VRT:
-
Verses retrieved using Thesaurus
- VRW:
-
Verses retrieved using words
- VSM:
-
Vector space model
- WSD:
-
Word sense disambiguation
References
Abdelali, A., Cowie, J., Farwell, D., Ogden, W., & Helmreich S. (2003). Cross-language information retrieval using ontology. In Proceedings of TALN ’2003, Batz-sur-Mer, France.
Abdelali, A., Cowie, J., & Soliman, H. (2004). Arabic information retrieval perspectives. In Proceedings of JEP-TALN 2004 Arabic Language Processing.
Abdelali, A., Cowie, J., & Soliman, H. (2006). Improving query expansion precision using latent semantic analysis: Application on Arabic retrieval. Journies d’Etudes sur le Traitement Automatique de la Langue Arabe (JETALA), Rabat, Morocco.
AbdulJaleel, N., & Larkey, L. (2003). Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 139–146.
Aljlayl, M., Frieder, O., & Grossman, D. (2002). On Arabic-English cross-language information retrieval: A machine translation approach. In Proceedings of the Third International Conference on Information Technology, pp. 2–7.
Al-Maskari, A., Sanderson, M., & Clough, P. (2007). Arabic users’ satisfaction with the online information as obtained from Google. In Proceedings of Sixth International Conference on Conceptions of Library and Information Science (CoLIS).
Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.
Black, W., & Elkateb, S. (2004). A prototype English-Arabic dictionary based on WordNet. In Proceedings of 2nd Global WordNet Conference, (GWC 2004), pp. 67–74.
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic wordnet project. In Proceedings of the Third International WordNet Conference, (GWC 2006), pp. 295–299.
Buckwalter, T. (2007). Issues in Arabic morphological analysis. In A. Soudi, A. Van den Bsch, & G. Neumann (Eds.), Arabic computational morphology (pp. 23–41). Netherlands: Springer. ISBN 978-1-4020-6045-8.
Debili, F., Achour, H., & Souissi, E. (2002). Del’etiquetage grammatical a’ la voyellation automatique de l’arabe. Correspondances (Vol. 71, pp. 10–28). Tunis: Institut de Recherche sur le Maghreb Contemporain.
Diab, M. (2007). Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 89–96.
Dumais, S., Landauer, T., & Littman, M. (1996). Automatic cross-linguistic information retrieval using latent semantic indexing. In SIGIR’96-Workshop on Cross-Linguistic Information Retrieval, pp. 16–23.
Elkateb, S., & Black, W. (2004). A bilingual dictionary with enriched lexical information. In Proceedings of NEMLAR Cairo, Egypt 2004 Arabic Language Tools and Resources, pp. 79–84.
El-Helw, A., & Aly, H. (2004). An intelligent database application for the semantic web. In Proceedings of CSITeA-04 Conference, Cairo, Egypt.
El-Sadany, T., & Hashish, M. (1988). Semi-automatic vowelization of Arabic verbs. In 10th NC Conference, Jeddah, Saudi Arabia.
French, J., Powell, A., Gey, F., & Perelman, N. (2001). Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 199–206.
Gal, Y. (2002). An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 27–33.
Gey, F., Kando, N., & Peters, C. (2002). Cross language information retrieval: A research roadmap. ACM SIGIR Forum, 36(2), 72–80.
Grefenstette, G. (1996). Cross-linguistic information retrieval workshop. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in IR, p. 344.
Grefenstette, G., Semmar, N., & Elkateb-Gara, F. (2005). Modifying a natural language processing system for European languages to treat Arabic in information processing and information retrieval applications. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 31–38.
Hammo, B., Abu-Salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 55–65.
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the Arabic language. Computers and the Humanities, 38(4), 379–415.
Hayashi, Y., Kikui, G., & Susaki, S. (1997). TITAN: A cross-linguistic search engine for the WWW. In Working Notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp. 58–65.
Hedlund, T., Airio, E., Keskustalo, H., Lehtokangas, R., Pirkola, A., & Järvelin, K. (2004). Dictionary based cross-language information retrieval: Learning experiences from CLEF 2000–2002. Information Retrieval, 7(1), 99–119.
Hull, D., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57.
Kampas, J. (2004). Improving retrieval effectiveness by reranking documents based on controlled vocabulary. Lecture Notes in Computer Science, 2997, 283–295.
Khadir, M. (2002). Quran lexicon. Retrieved April 10, 2008 from http://www.al-mishkat.com/words/book.htm.
Khoja, S. (1999). Stemming Arabic text. Retrieved June 20, 2007 from http://zeus.cs.pacificu.edu/shereen/research.htm.
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL 2001, pp. 20–25.
Kirchhoff, K., & Vergyri, D. (2005). Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication, 46(1), 37–51.
Kubaisi, A. (2006). Quran words. Retrieved April 10, 2008 from http://www.islamiyyat.com/kalema.htm.
Landauer, T., & Littman, M. (1990). Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31–38.
Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in IR, pp. 275–282.
Larkey, L., & Connell, M. (2005). Structured queries, language modeling, and relevance modeling in cross-language information retrieval. Information Processing and Management: An International Journal, 41(3), 457–473.
Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.
Lundquist, C., Frieder, O., Holmes, D., & Grossman, D. (1997). A parallel relational database management system approach to relevance feedback in information retrieval. Journal of the American Society of Information Science (JASIS), 50(5), 413–426.
Moukdad, H. (2004). How do search engines handle Chinese queries? Lost in cyberspace: How do search engines handle arabic queries? In Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science. Retrieved October 1, 2008 from www.cais-acsi.ca/proceedings/2004/moukdad_2004.pdf.
Moukdad, H., & Cui, H. (2005). How do search engines handle Chinese queries? Webology, 2(3). Retrieved October 1, 2008 from www.Webology.ir/2005/v2n3/a17.html.
Oard, D. (1998). A comparative study of query and document translation for cross-language information retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pp. 472–483.
Oard, D. (2000). Evaluating interactive cross-language information retrieval: Document selection. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, pp. 57–71.
Pirkola, A., Hedlund, T., Keskustalo, H., & Järvelin, K. (2001). Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4(3–4), 209–230.
Qiu, Y., & Frei, H. (1993). Concept based query expansion. In Proceedings of the 16th ACM SIGIR International Conference on Research and Development in IR, pp. 160–169.
Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York: McGraw-Hill Book Company.
Salton, G. (1989). Automatic text processing—the transformation analysis and retrieval of information by computer. MA: Addison Wesley.
Semmar, N., & Fluhr, C. (2007). Arabic to French sentence alignment: Exploration of a cross-language information retrieval approach. In Proceedings of the 5th Workshop on Important Unresolved Matters, pp. 73–80.
Semmar, N., Laib, M., & Fluhr, Ch. (2006). Using stemming in morphological analyzer to improve Arabic information retrieval. In Proceedings of TALN 2006, pp. 317–327.
Sroka, M. (2000). Web search engines for Polish information retrieval: Questions of search capabilities and retrieval performance. International Information & Library Research, 32(2), 87–98.
Strzalkowski, T., & Vauthey, B. (1992). Information retrieval using robust natural language processing. In Proceedings of ACL-92, pp. 104–111.
Talvensaari, T., Juhola, M., Laurikkala, J., & Järvelin, K. (2007). Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research articles. Journal of the American Society for Information Science and Technology, 58(3), 322–334.
Vectomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of Information Science, 32(4), 324–333.
Virga, P., & Khudanpur, S. (2003). Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, Vol. 15, pp. 57–64.
Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 269–274.
Zaidi, S., & Laskri, M. (2005). A cross-language information retrieval based on an Arabic ontology in the legal domain. In Proceedings of the International Conference on Signal-Image Technology and Internet-Based Systems (SITIS’05), pp. 86–91.
Zitouni, I., Sorensen, J., & Sarikaya R. (2006). Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics, pp. 577–584.
Acknowledgment
We would like to thank Shereen Khoja for providing her stemmer, Prof. Nadim Obeid for his valuable suggestions to improve this work and Mahmoud El-Hajj for helping with construction the thesaurus and the database implementation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hammo, B.H. Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Inf Retrieval 12, 300–323 (2009). https://doi.org/10.1007/s10791-008-9081-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-008-9081-9