Abstract
We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “كتاب” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of Arabic script, automatic language identification methods fail to recognize the Arabizi text as Arabic and label it as English, Polish, or the like. We propose a cross-script retrieval system using automatic rule-based mapping and statistical selection of transliteration keywords. We show that our system can achieve effective cross-script retrieval with minimal knowledge of the target language and without the need to rely on external translation or transliteration tools or lexica. With minimal human annotation, our technique can be applied to other languages such as Hindi and Greek, which are commonly converted to a Roman character set similarly.
N. Naji−This work was done while the author was at the University of Massachusetts Amherst, supported by the Swiss National Science Foundation Early Postdoc.Mobility fellowship project P2NEP2_151940
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Webpage accessed January 3rd 2016, 19:17 http://langs.eserver.org/qalam.
- 3.
Webpage accessed January 3rd 2016, 19:18 http://languagelog.ldc.upenn.edu/myl/ldc/morph/buckwalter.html.
References
Chalabi, A., Gerges, H.: Romanized arabic transliteration. In: Proceedings of the Second Workshop on Advances in Text Input Methods, pp. 89–96 (Mumbai, India, 2012). The COLING 2012 Organizing Committee (2012)
Darwish, K.: Arabizi detection and conversion to Arabic (2013). arXiv:1306.6755 [cs.CL], arXiv. http://arxiv.org/abs/1306.6755
Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the 18th Conference on Computational Language Learning (Baltimore, Maryland USA, 2014) (2014)
Habash, N., Ryan, R., Owen, R., Ramy, E., Nadt, T.: Morphological analysis and disambiguation for dialectal arabic. In: Proceedings of Conference of the North American Association for Computational Linguistics (NAACL) (Atlanta, Georgia, 2013) (2013)
Arfath, P., Al-Badrashiny, M., Diab, T.M., Habash, N., Pooleery, M., Rambow, O., Roth, M.R., Altantawy, M.: DIRA: Dialectal arabic information retrieval assistant. demo paper. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) (Nagoya, Japan, 2013) (2013)
Gupta, P., Bali, P., Banchs, E., R., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR 2014. New York, NY, USA, pp. 677–686 (2014)
Saha Roy, R., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: 5th Forum for Information Retrieval Evaluation (2013)
Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (2002)
AbdulJaleel, N., Larkey, S.L.: Statistical transliteration for English-Arabic cross language information retrieval. In: Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM 2003). ACM (New York, NY, USA, 2003), pp. 139–146 (2003)
May, J., Benjira, Y., Echihabi, A.: An arabizi-english social media statistical machine translation system. In: Proceedings of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas, Vancouver, Canada (2014)
Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of arabizi into arabic orthography: developing a parallel annotated arabizi-arabic script SMS/Chat corpus. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (Doha, Qatar, 2014)
Acknowledgements
This work is supported by the Swiss National Science Foundation Early Postdoc.Mobility fellowship project P2NEP2_151940 and is supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Naji, N., Allan, J. (2016). On Cross-Script Information Retrieval. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_70
Download citation
DOI: https://doi.org/10.1007/978-3-319-30671-1_70
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30670-4
Online ISBN: 978-3-319-30671-1
eBook Packages: Computer ScienceComputer Science (R0)