Skip to main content

On Cross-Script Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9626))

Included in the following conference series:

Abstract

We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “كتاب” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of Arabic script, automatic language identification methods fail to recognize the Arabizi text as Arabic and label it as English, Polish, or the like. We propose a cross-script retrieval system using automatic rule-based mapping and statistical selection of transliteration keywords. We show that our system can achieve effective cross-script retrieval with minimal knowledge of the target language and without the need to rely on external translation or transliteration tools or lexica. With minimal human annotation, our technique can be applied to other languages such as Hindi and Greek, which are commonly converted to a Roman character set similarly.

N. Naji−This work was done while the author was at the University of Massachusetts Amherst, supported by the Swiss National Science Foundation Early Postdoc.Mobility fellowship project P2NEP2_151940

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://ciir.cs.umass.edu/downloads/.

  2. 2.

    Webpage accessed January 3rd 2016, 19:17 http://langs.eserver.org/qalam.

  3. 3.

    Webpage accessed January 3rd 2016, 19:18 http://languagelog.ldc.upenn.edu/myl/ldc/morph/buckwalter.html.

References

  1. Chalabi, A., Gerges, H.: Romanized arabic transliteration. In: Proceedings of the Second Workshop on Advances in Text Input Methods, pp. 89–96 (Mumbai, India, 2012). The COLING 2012 Organizing Committee (2012)

    Google Scholar 

  2. Darwish, K.: Arabizi detection and conversion to Arabic (2013). arXiv:1306.6755 [cs.CL], arXiv. http://arxiv.org/abs/1306.6755

  3. Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the 18th Conference on Computational Language Learning (Baltimore, Maryland USA, 2014) (2014)

    Google Scholar 

  4. Habash, N., Ryan, R., Owen, R., Ramy, E., Nadt, T.: Morphological analysis and disambiguation for dialectal arabic. In: Proceedings of Conference of the North American Association for Computational Linguistics (NAACL) (Atlanta, Georgia, 2013) (2013)

    Google Scholar 

  5. Arfath, P., Al-Badrashiny, M., Diab, T.M., Habash, N., Pooleery, M., Rambow, O., Roth, M.R., Altantawy, M.: DIRA: Dialectal arabic information retrieval assistant. demo paper. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) (Nagoya, Japan, 2013) (2013)

    Google Scholar 

  6. Gupta, P., Bali, P., Banchs, E., R., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International ACM SIGIR 2014. New York, NY, USA, pp. 677–686 (2014)

    Google Scholar 

  7. Saha Roy, R., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE 2013 track on transliterated search. In: 5th Forum for Information Retrieval Evaluation (2013)

    Google Scholar 

  8. Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages (2002)

    Google Scholar 

  9. AbdulJaleel, N., Larkey, S.L.: Statistical transliteration for English-Arabic cross language information retrieval. In: Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM 2003). ACM (New York, NY, USA, 2003), pp. 139–146 (2003)

    Google Scholar 

  10. May, J., Benjira, Y., Echihabi, A.: An arabizi-english social media statistical machine translation system. In: Proceedings of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas, Vancouver, Canada (2014)

    Google Scholar 

  11. Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of arabizi into arabic orthography: developing a parallel annotated arabizi-arabic script SMS/Chat corpus. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (Doha, Qatar, 2014)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Swiss National Science Foundation Early Postdoc.Mobility fellowship project P2NEP2_151940 and is supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nada Naji .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Naji, N., Allan, J. (2016). On Cross-Script Information Retrieval. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30671-1_70

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30670-4

  • Online ISBN: 978-3-319-30671-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics