Abstract
This paper provides a novel model for English/Arabic Query Translation to search Arabic text, and then expands the Arabic query to handle Arabic OCR-Degraded Text. This includes detection and translation of word collocations, translating single words, transliterating names, and disambiguating translation and transliteration through different approaches. It also expands the query with the expected OCR-Errors that are generated from the Arabic OCR-Errors simulation model which proposed inside the paper. The query translation and expansion model has been supported by different libraries proposed in the paper like a Word Collocations Dictionary, Single Words Dictionaries, a Modern Arabic corpus, and other tools. The model gives high accuracy in translating the Queries from English to Arabic solving the translation and transliteration ambiguities and with orthographic query expansion; it gives high degree of accuracy in handling OCR errors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The official web site of the Library of Congress (Retrieved December 4, 2006), http://www.loc.gov/about/facts.html
Kanungo, T., Marton, G.A., Bulbul, O.: OmniPage vs. Sakhr: Paired model evaluation of two Arabic OCR products. In: Proc. of SPIE Conf. on Document Recognition and Retrieval (1999)
Al-Kharashi, I.A., Evans, M.W.: Comparing words, stems, and roots as index terms in an Arabic information retrieval system. Journal of the American Society for Information Science (JASIS) 5(8), 548–560 (1994)
Abu-Salem, H., Al-Omari, M., Evens, M.: Stemming Methodologies over Individual Query Words for an Arabic Information Retrieval System. JASIS 50(6), 524–529 (1999)
Beesley, K.: Arabic Morphological Analysis on the Internet. In: Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, Cambridge (1998)
Aljlayl, M., Frieder, O.: On Arabic Search: Improving the Retrieval Effectiveness Via Light Stemming Approach. In: Proceeding the 11th ACM International Conference on Information and Knowledge Management, Illions Institute of Technology, pp. 340–347. ACM Press, New York
Khoja, S., Garside, R.: Stemming Arabic Text. Computing Department, Lancaster University, UK (Retrieved, April 2007), http://zeus.cs.pacificu.edu/shereen/research.htm
Larkey, L.S., Connell, M.E.: Arabic Information Retrieval at Umass in TREC-10. In: Text REtrieval Conference (2001)
Hunston, S.: Corpora in applied linguistics. Cambridge University Press, Cambridge (2002)
Hmeidi, I., Kanaan, G., Evens, M.: Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. Journal of the American Society for Information Science 48(10), 867–881 (1997)
Goweder, A., De Roeck, A.: Assessment of a significant Arabic corpus. In: The Arabic NLP Workshop at ACL/EACL 2001, Toulouse, France (2001)
Darwish, K., Doermann, D., Jones, R., Oard, D., Rautiainen, M.: TREC-10 experiments at University of Maryland CLIR and video. In: Text RE-trieval Conference TREC10 Proceedings, Gaithersburg, MD, pp. 549–562 (2001)
Pirkola, A.: The Effects of Query Structure and Dictionary Setups in a Dictionary-based Cross-Language Information Retrieval. In: SIGIR 1998, Melbourne, Australia (1998)
Oard, D.: A Comparative Study of Query and Document Translation for Cross-language Information Retrieval. In: Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pp. 472–483 (1998)
Davis, M.W., Dunning, T.E.: Query Translation Using Evolutionary Programming for Multi-lingual Information Retrieval. In: Proceedings of the Fifth Annual Conference on Evolutionary Programming (1995)
Landauer, T.K., Dumais, S.T., Littman, M.L.: Full Automatic Cross-Language Document Retrieval using Latent Semantic Indexing. In: 1996, update of the original paper on the 6th Conf. of UW center for New OED and Text Research, pp. 31–38 (1990)
Sheridan, P., Ballerini, J.P.: Experiments in Multilingual Information Retrieval using the SPIDER System. In: The 19th Annual International ACM SIGIR 1996, pp. 58–65 (1996)
Adriani, M., Croft, W.: The Effectiveness of a Dictionary-Based Technique for Indonesian-English Cross-Language Text Retrieval, CLIR Technical Report IR-170, University of Massachusetts, Amherst (1997)
Ballesteros, L., Croft, B.: Dictionary Methods for Cross-Lingual Information Retrieval. In: 7th DEXA Conf. on Database and Expert Systems Applications, pp. 791–801 (1996)
Ballesteros, L., Croft, B.: Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval. In: SIGIR 1997, pp. 84–91 (1997)
Xu, J., Croft, W.B.: Query Expansion using Local and Global Document Analysis. In: The 19th Annual International ACM SIGIR 1996, Zurich, Switzerland, pp. 4–11 (1996)
Ballesteros, L., Croft, B.: Resolving Ambiguity for Cross-Language Retrieval. In: SIGIR 1998, pp. 64–71 (1998)
Aljlayl, M., Frieder, O.: Effective Arabic-English Cross-Language Information Retrieval Via Machine-Readable Dictionaries and Machine Translation, Information Retrieval Laboratory, Illinois Institute of Technology (2002)
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, http://trec.nist.gov/
Hasnah, A., Evens, M.: Arabic/English Cross Language Information Retrieval Using a Bilingual Dictionary, Department of Computer Science University- Qatar, and Illinois Institute of Technology
Darwish, K.: Probabilistic Methods for Searching OCR-Degraded Arabic Text, A PhD Dissertation, University of Maryland, College Park (2003)
Elaraby Ahmed, M.A.M.: A Large-Scale Computational Processor of the Arabic Morphology, and Applications, M.Sc. Thesis, Cairo University, Faculty of Engineering, pp. 37–39 (2000)
(Retrieved December 4, 2006), http://www.moheet.com
Fellbaum, C.: WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)
Adly, A.: Senior Translation Consultant
Retrieved December 4, 2008, http://www.arabeyes.org
Last time visited April 15, 2007, http://sourceforge.net/project/showfiles.php?group_id=34866&package_id=93898
Last time visited April 15, 2007, http://crl.nmsu.edu/Resources/lang_res/arabic.html
Last time visited April 15, 2007, http://wordnet.princeton.edu/
Last time visited April 15, 2007, http://dictionary.Sakhr.com/
AbdulJaleel, N., Larkey, L.S.: Statistical Transliteration for English-Arabic Cross Language Information Retrieval. In: Proceedings of the twelfth international conference on Information and knowledge management table of contents, New Orleans, LA, USA (2003)
WordNet documentations, MORHY (7N), Princeton University, Cognitive Science Laboratory (January 2005), http://wordnet.princeton.edu/
Rice, S.V., Kanai, J., Nartker, T.A.: The 3rd Annual Test of OCR Accuracy, TR 94-03, ISRI, University of Nevada, Las Vegas (April 1994)
Adobe Company, http://www.adobe.com
Sakhr Software, http://www.Sakhr.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elghazaly, T., Fahmy, A. (2009). Query Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-00382-0_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)