Skip to main content

Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions

  • Conference paper
  • First Online:
Complex, Intelligent, and Software Intensive Systems (CISIS 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 993))

Included in the following conference series:

  • 1705 Accesses

Abstract

Rapid growth in internet technology lead to increase the usage of social media platforms which make communication between users easier. Through the communication users used their daily languages which considered as non-standard language. The non-slandered text contains lots of noise, such as abbreviations, slang which used more in English languages and dialect words which are widely used in Arabic language. These texts face challenging using any natural language processing tools. Therefore, these texts need to be treated and transferred to be similar to their standard form. According to that the normalization and translation approach have been used to transfer the informal text. However, using these approach need large label or parallel datasets. While high resource languages such as English have enough parallel datasets, low resource languages such as Arabic is lack of enough parallel dataset. Therefore, in this paper we focus on the Arabic and Arabic dialects as a low resource language in the era of transferring non-stander text using normalization and translation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han, B.: Improving the utility of social media with natural language processing. Ph.D. thesis, University of Melbourne, Department of Computing and Information Systems (2014)

    Google Scholar 

  2. Almansor, E.H.: Translating Arabic as low resource language using distribution representation and neural machine translation models. Ph.D. thesis (2018)

    Google Scholar 

  3. Hidayatullah, A.F.: Language tweet characteristics of Indonesian citizens. In: 2015 International Conference on Science and Technology (TICST), pp. 397–401. IEEE (2015)

    Google Scholar 

  4. Ghareb, A.S., Hamdan, A.R., Bakar, A.A., Yaakub, M.R.: Hybrid statistical rule-based classifier for Arabic text mining. J. Theoret. Appl. Inf. Technol. 71(2) (2015)

    Google Scholar 

  5. Shaalan, K., Bakr, H., Ziedan, I.: Transferring Egyptian colloquial dialect into modern standard Arabic. In: International Conference on Recent Advances in Natural Language Processing (RANLP–2007), Borovets, Bulgaria, pp. 525–529 (2007)

    Google Scholar 

  6. Almansor, E.H., Al-Ani, A., Al, A.: Translating dialectal Arabic as low resource language using word embedding. In: RANLP, pp. 52–57 (2017)

    Google Scholar 

  7. Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016)

  8. Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource translation with multi-lingual neural machine translation. arXiv preprint arXiv:1606.04164 (2016)

  9. Wu, J., Hou, H., Shen, Z., Du, J., Li, J.: Adapting attention-based neural network to low-resource Mongolian-Chinese machine translation. In: International Conference on Computer Processing of Oriental Languages, pp. 470–480. Springer (2016)

    Google Scholar 

  10. Alqudsi, A., Omar, N., Shaker, K.: Arabic machine translation: a survey. Artif. Intell. Rev. 1–24 (2014)

    Google Scholar 

  11. Pennell, D.L., Liu, Y.: Normalization of informal text. Comput. Speech Lang. 28(1), 256–277 (2014)

    Article  Google Scholar 

  12. Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Comput. Speech Lang. 15(3), 287–333 (2001)

    Article  Google Scholar 

  13. Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 421–432. Association for Computational Linguistics (2012)

    Google Scholar 

  14. ElSahar, H., El-Beltagy, S.R.: A fully automated approach for Arabic slang lexicon extraction from microblogs. In: CICLing (1), pp. 79–91 (2014)

    Google Scholar 

  15. Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2427–2430. ACM (2012)

    Google Scholar 

  16. Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10(3), 157–174 (2007)

    Article  Google Scholar 

  17. Deepak, P., Subramaniam, V.: Correcting SMS text automatically (2012)

    Google Scholar 

  18. Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pp. 71–78. Association for Computational Linguistics (2009)

    Google Scholar 

  19. Whitelaw, C., Hutchinson, B., Chung, G.Y., Ellis, G.: Using the web for language independent spellchecking and autocorrection. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 890–899. Association for Computational Linguistics (2009)

    Google Scholar 

  20. Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779. Association for Computational Linguistics (2010)

    Google Scholar 

  21. Aransa, W.: Statistical machine translation of the Arabic language. Ph.D. thesis, Université du Maine, Le Mans, France (2015)

    Google Scholar 

  22. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)

    Google Scholar 

  23. Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368–378. Association for Computational Linguistics (2011)

    Google Scholar 

  24. Pennell, D., Liu, Y.: Toward text message normalization: modeling abbreviation generation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5364–5367. IEEE (2011)

    Google Scholar 

  25. Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)

    Google Scholar 

  26. Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 33–40. Association for Computational Linguistics (2006)

    Google Scholar 

  27. Hernández, A.: A ngram-based statistical machine translation approach for text normalization on chat-speak style communications (2009)

    Google Scholar 

  28. Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 189–196. Association for Computational Linguistics (2010)

    Google Scholar 

  29. Okpor, M.: Machine translation approaches: issues and challenges. Int. J. Comput. Sci. Issues (IJCSI) 11(5), 159 (2014)

    Google Scholar 

  30. Salem, Y., Hensman, A., Nolan, B.: Implementing Arabic-to-English machine translation using the role and reference grammar linguistic model (2008)

    Google Scholar 

  31. Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics (2006)

    Google Scholar 

  32. Lee, Y.S., Papineni, K., Roukos, S., Emam, O., Hassan, H.: Language model based Arabic word segmentation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 399–406. Association for Computational Linguistics (2003)

    Google Scholar 

  33. Attia, M.: Developing a robust Arabic morphological transducer using finite state technology. In: 8th Annual CLUK Research Colloquium, pp. 9–18 (2005)

    Google Scholar 

  34. Attia, M.A.: Arabic tokenization system. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 65–72. Association for Computational Linguistics (2007)

    Google Scholar 

  35. Phillips, A.B., Cavalli-Sforza, V.: Arabic-to-English example based machine translation using context-insensitive morphological analysis. Journées d’Etudes sur le Traitement Automatique de la Langue Arabe (2006)

    Google Scholar 

  36. Alansary, S., Nagi, M., Adly, N.: Towards analyzing the international corpus of Arabic (ICA): progress of morphological stage. In: 8th International Conference on Language Engineering, Egypt, pp. 1–23 (2008)

    Google Scholar 

  37. Köprü, S., Miller, J.: A unification based approach to the morphological analysis and generation of Arabic. In: Farghaly, A., Megerdoomian, K., Sawaf, H. (eds.) 3rd Workshop on Computational Approaches to Arabic Script-based Languages at MT Summit XII. IAMT, Ottowa (2009)

    Google Scholar 

  38. Habash, N.: Introduction to Arabic natural language processing. In: Tutoriel in the ACL 43th Annual Meeting (2005)

    Google Scholar 

  39. Bisazza, A., Federico, M.: Chunk-based verb reordering in VSO sentences for Arabic-English statistical machine translation. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pp. 235–243. Association for Computational Linguistics (2010)

    Google Scholar 

  40. Carpuat, M., Marton, Y., Habash, N.: Improving Arabic-to-English statistical machine translation by reordering post-verbal subjects for alignment. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 178–183. Association for Computational Linguistics (2010)

    Google Scholar 

  41. Shirko, O., Omar, N., Arshad, H., Albared, M.: Machine translation of noun phrases from Arabic to English using transfer-based approach. J. Comput. Sci. 6(3), 350 (2010)

    Article  Google Scholar 

  42. Habash, N.Y., Rambow, O.C., Chiang, D., Diab, M., Hwa, R., Sima’an, K., Lacey, V., Levy, R., Nichols, C., Shareef, S.: Parsing Arabic dialects (2006)

    Google Scholar 

  43. Bakr, H.A., Shaalan, K., Ziedan, I.: A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In: The 6th International Conference on Informatics and Systems, infos2008, Cairo University (2008)

    Google Scholar 

  44. Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)

    Google Scholar 

  45. Salloum, W., Habash, N.: Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 10–21. Association for Computational Linguistics (2011)

    Google Scholar 

  46. Mohamed, E., Mohit, B., Oflazer, K.: Transforming standard Arabic to colloquial Arabic. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 176–180. Association for Computational Linguistics (2012)

    Google Scholar 

  47. Al-Sabbagh, R., Girju, R.: Mining the web for the induction of a dialectical Arabic lexicon. In: LREC (2010)

    Google Scholar 

  48. Boujelbane, R., Khemakhem, M.E., Belguith, L.H.: Mapping rules for building a Tunisian dialect lexicon and generating corpora. In: IJCNLP, pp. 419–428 (2013)

    Google Scholar 

  49. El-taher, F.E.Z., Hammouda, A.A., Abdel-Mageid, S.: Automation of understanding textual contents in social networks. In: 2016 International Conference on Selected Topics in Mobile & Wireless Networking (MoWNeT), pp. 1–7. IEEE (2016)

    Google Scholar 

  50. Charoenpornsawat, P., Sornlertlamvanich, V., Charoenporn, T.: Improving translation quality of rule-based machine translation. In: Proceedings of the 2002 COLING Workshop on Machine Translation in Asia, vol. 16, pp. 1–6. Association for Computational Linguistics (2002)

    Google Scholar 

  51. Stalls, B.G., Knight, K.: Translating names and technical terms in Arabic text. In: Proceedings of the Workshop on Computational Approaches to Semitic Languages, pp. 34–41. Association for Computational Linguistics (1998)

    Google Scholar 

  52. Lee, Y.S.: Morphological analysis for statistical machine translation. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 57–60. Association for Computational Linguistics (2004)

    Google Scholar 

  53. Hasan, S., El Isbihani, A., Ney, H.: Creating a large-scale Arabic to French statistical machine translation system. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 855–858 (2006)

    Google Scholar 

  54. Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 173–181. Association for Computational Linguistics (2009)

    Google Scholar 

  55. Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 153–156. Association for Computational Linguistics (2008)

    Google Scholar 

  56. Riesa, J., Yarowsky, D.: Minimally supervised morphological segmentation with applications to machine translation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA06), pp. 185–192 (2006)

    Google Scholar 

  57. Mansour, S.: MorphTagger: HMM-based Arabic segmentation for statistical machine translation. In: International Workshop on Spoken Language Translation (IWSLT) 2010 (2010)

    Google Scholar 

  58. Sajjad, H., Darwish, K., Belinkov, Y.: Translating dialectal Arabic to English. In: ACL, vol. 2, pp. 1–6 (2013)

    Google Scholar 

  59. Sajjad, H., Durrani, N., Guzman, F., Nakov, P., Abdelali, A., Vogel, S., Salloum, W., Kholy, A.E., Habash, N.: Egyptian Arabic to English statistical machine translation system for NIST OpenMT 2015. arXiv preprint arXiv:1606.05759 (2016)

  60. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  61. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  62. Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 (2016)

  63. Ling, W., Luís, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W., Trancoso, I.: Finding function in form: compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)

  64. Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMs. arXiv preprint arXiv:1508.00657 (2015)

  65. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

    Google Scholar 

  66. Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)

    Google Scholar 

  67. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016)

    Google Scholar 

  68. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)

    Google Scholar 

  69. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

    Google Scholar 

  70. Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk, H., Bengio, Y.: On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015)

  71. Ramachandran, P., Liu, P.J., Le, Q.V.: Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683 (2016)

  72. Zhao, S., Zhang, Z.: An efficient character-level neural machine translation. arXiv preprint arXiv:1608.04738 (2016)

  73. Almahairi, A., Cho, K., Habash, N., Courville, A.: First result on Arabic neural machine translation. arXiv preprint arXiv:1606.02680 (2016)

  74. Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1, pp. 1370–1380 (2014)

    Google Scholar 

  75. Setiawan, H., Huang, Z., Devlin, J., Lamar, T., Zbib, R., Schwartz, R., Makhoul, J.: Statistical machine translation features with multitask tensor networks. arXiv preprint arXiv:1506.00698 (2015)

  76. Almansor, E.H., Al-Ani, A.: A hybrid neural machine translation technique for translating low resource languages. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 347–356. Springer (2018)

    Google Scholar 

  77. Costa-Jussà, M.R., Fonollosa, J.A.R.: Character-based neural machine translation. CoRR abs/1603.00810 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ebtesam H. Almansor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Almansor, E.H., Al-Ani, A., Hussain, F.K. (2020). Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions. In: Barolli, L., Hussain, F., Ikeda, M. (eds) Complex, Intelligent, and Software Intensive Systems. CISIS 2019. Advances in Intelligent Systems and Computing, vol 993. Springer, Cham. https://doi.org/10.1007/978-3-030-22354-0_17

Download citation

Publish with us

Policies and ethics