Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text

Elghazaly, Tarek; Fahmy, Aly

doi:10.1007/978-3-642-00382-0_39

Tarek Elghazaly¹⁷ &
Aly Fahmy¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1775 Accesses
2 Citations
1 Altmetric

Abstract

This paper provides a novel model for English/Arabic Query Translation to search Arabic text, and then expands the Arabic query to handle Arabic OCR-Degraded Text. This includes detection and translation of word collocations, translating single words, transliterating names, and disambiguating translation and transliteration through different approaches. It also expands the query with the expected OCR-Errors that are generated from the Arabic OCR-Errors simulation model which proposed inside the paper. The query translation and expansion model has been supported by different libraries proposed in the paper like a Word Collocations Dictionary, Single Words Dictionaries, a Modern Arabic corpus, and other tools. The model gives high accuracy in translating the Queries from English to Arabic solving the translation and transliteration ambiguities and with orthographic query expansion; it gives high degree of accuracy in handling OCR errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The official web site of the Library of Congress (Retrieved December 4, 2006), http://www.loc.gov/about/facts.html
Kanungo, T., Marton, G.A., Bulbul, O.: OmniPage vs. Sakhr: Paired model evaluation of two Arabic OCR products. In: Proc. of SPIE Conf. on Document Recognition and Retrieval (1999)
Google Scholar
Al-Kharashi, I.A., Evans, M.W.: Comparing words, stems, and roots as index terms in an Arabic information retrieval system. Journal of the American Society for Information Science (JASIS) 5(8), 548–560 (1994)
Article Google Scholar
Abu-Salem, H., Al-Omari, M., Evens, M.: Stemming Methodologies over Individual Query Words for an Arabic Information Retrieval System. JASIS 50(6), 524–529 (1999)
Article Google Scholar
Beesley, K.: Arabic Morphological Analysis on the Internet. In: Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, Cambridge (1998)
Google Scholar
Aljlayl, M., Frieder, O.: On Arabic Search: Improving the Retrieval Effectiveness Via Light Stemming Approach. In: Proceeding the 11th ACM International Conference on Information and Knowledge Management, Illions Institute of Technology, pp. 340–347. ACM Press, New York
Google Scholar
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, UK (Retrieved, April 2007), http://zeus.cs.pacificu.edu/shereen/research.htm
Larkey, L.S., Connell, M.E.: Arabic Information Retrieval at Umass in TREC-10. In: Text REtrieval Conference (2001)
Google Scholar
Hunston, S.: Corpora in applied linguistics. Cambridge University Press, Cambridge (2002)
Book Google Scholar
Hmeidi, I., Kanaan, G., Evens, M.: Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)
Article Google Scholar
Goweder, A., De Roeck, A.: Assessment of a significant Arabic corpus. In: The Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France (2001)
Google Scholar
Darwish, K., Doermann, D., Jones, R., Oard, D., Rautiainen, M.: TREC-10 experiments at University of Maryland CLIR and video. In: Text RE-trieval Conference TREC10 Proceedings, Gaithersburg, MD, pp. 549–562 (2001)
Google Scholar
Pirkola, A.: The Effects of Query Structure and Dictionary Setups in a Dictionary-based Cross-Language Information Retrieval. In: SIGIR 1998, Melbourne, Australia (1998)
Google Scholar
Oard, D.: A Comparative Study of Query and Document Translation for Cross-language Information Retrieval. In: Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pp. 472–483 (1998)
Google Scholar
Davis, M.W., Dunning, T.E.: Query Translation Using Evolutionary Programming for Multi-lingual Information Retrieval. In: Proceedings of the Fifth Annual Conference on Evolutionary Programming (1995)
Google Scholar
Landauer, T.K., Dumais, S.T., Littman, M.L.: Full Automatic Cross-Language Document Retrieval using Latent Semantic Indexing. In: 1996, update of the original paper on the 6th Conf. of UW center for New OED and Text Research, pp. 31–38 (1990)
Google Scholar
Sheridan, P., Ballerini, J.P.: Experiments in Multilingual Information Retrieval using the SPIDER System. In: The 19th Annual International ACM SIGIR 1996, pp. 58–65 (1996)
Google Scholar
Adriani, M., Croft, W.: The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval, CLIR Technical Report IR-170, University of Massachusetts, Amherst (1997)
Google Scholar
Ballesteros, L., Croft, B.: Dictionary Methods for Cross-Lingual Information Retrieval. In: 7th DEXA Conf. on Database and Expert Systems Applications, pp. 791–801 (1996)
Google Scholar
Ballesteros, L., Croft, B.: Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval. In: SIGIR 1997, pp. 84–91 (1997)
Google Scholar
Xu, J., Croft, W.B.: Query Expansion using Local and Global Document Analysis. In: The 19th Annual International ACM SIGIR 1996, Zurich, Switzerland, pp. 4–11 (1996)
Google Scholar
Ballesteros, L., Croft, B.: Resolving Ambiguity for Cross-Language Retrieval. In: SIGIR 1998, pp. 64–71 (1998)
Google Scholar
Aljlayl, M., Frieder, O.: Effective Arabic-English Cross-Language Information Retrieval Via Machine-Readable Dictionaries and Machine Translation, Information Retrieval Laboratory, Illinois Institute of Technology (2002)
Google Scholar
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, http://trec.nist.gov/
Hasnah, A., Evens, M.: Arabic/English Cross Language Information Retrieval Using a Bilingual Dictionary, Department of Computer Science University- Qatar, and Illinois Institute of Technology
Google Scholar
Darwish, K.: Probabilistic Methods for Searching OCR-Degraded Arabic Text, A PhD Dissertation, University of Maryland, College Park (2003)
Google Scholar
Elaraby Ahmed, M.A.M.: A Large-Scale Computational Processor of the Arabic Morphology, and Applications, M.Sc. Thesis, Cairo University, Faculty of Engineering, pp. 37–39 (2000)
Google Scholar
(Retrieved December 4, 2006), http://www.moheet.com
Fellbaum, C.: WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Adly, A.: Senior Translation Consultant
Google Scholar
Retrieved December 4, 2008, http://www.arabeyes.org
Last time visited April 15, 2007, http://sourceforge.net/project/showfiles.php?group_id=34866&package_id=93898
Last time visited April 15, 2007, http://crl.nmsu.edu/Resources/lang_res/arabic.html
Last time visited April 15, 2007, http://wordnet.princeton.edu/
Last time visited April 15, 2007, http://dictionary.Sakhr.com/
AbdulJaleel, N., Larkey, L.S.: Statistical Transliteration for English-Arabic Cross Language Information Retrieval. In: Proceedings of the twelfth international conference on Information and knowledge management table of contents, New Orleans, LA, USA (2003)
Google Scholar
WordNet documentations, MORHY (7N), Princeton University, Cognitive Science Laboratory (January 2005), http://wordnet.princeton.edu/
Rice, S.V., Kanai, J., Nartker, T.A.: The 3rd Annual Test of OCR Accuracy, TR 94-03, ISRI, University of Nevada, Las Vegas (April 1994)
Google Scholar
Adobe Company, http://www.adobe.com
Sakhr Software, http://www.Sakhr.com

Download references

Author information

Authors and Affiliations

Faculty of Computers and Information, Cairo University, Giza, Egypt
Tarek Elghazaly & Aly Fahmy

Authors

Tarek Elghazaly
View author publications
You can also search for this author in PubMed Google Scholar
Aly Fahmy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elghazaly, T., Fahmy, A. (2009). Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics