Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions

Almansor, Ebtesam H.; Al-Ani, Ahmed; Hussain, Farookh Khadeer

doi:10.1007/978-3-030-22354-0_17

Ebtesam H. Almansor^17,18,
Ahmed Al-Ani¹⁷ &
Farookh Khadeer Hussain¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 993))

Included in the following conference series:

Conference on Complex, Intelligent, and Software Intensive Systems

1705 Accesses

Abstract

Rapid growth in internet technology lead to increase the usage of social media platforms which make communication between users easier. Through the communication users used their daily languages which considered as non-standard language. The non-slandered text contains lots of noise, such as abbreviations, slang which used more in English languages and dialect words which are widely used in Arabic language. These texts face challenging using any natural language processing tools. Therefore, these texts need to be treated and transferred to be similar to their standard form. According to that the normalization and translation approach have been used to transfer the informal text. However, using these approach need large label or parallel datasets. While high resource languages such as English have enough parallel datasets, low resource languages such as Arabic is lack of enough parallel dataset. Therefore, in this paper we focus on the Arabic and Arabic dialects as a low resource language in the era of transferring non-stander text using normalization and translation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, B.: Improving the utility of social media with natural language processing. Ph.D. thesis, University of Melbourne, Department of Computing and Information Systems (2014)
Google Scholar
Almansor, E.H.: Translating Arabic as low resource language using distribution representation and neural machine translation models. Ph.D. thesis (2018)
Google Scholar
Hidayatullah, A.F.: Language tweet characteristics of Indonesian citizens. In: 2015 International Conference on Science and Technology (TICST), pp. 397–401. IEEE (2015)
Google Scholar
Ghareb, A.S., Hamdan, A.R., Bakar, A.A., Yaakub, M.R.: Hybrid statistical rule-based classifier for Arabic text mining. J. Theoret. Appl. Inf. Technol. 71(2) (2015)
Google Scholar
Shaalan, K., Bakr, H., Ziedan, I.: Transferring Egyptian colloquial dialect into modern standard Arabic. In: International Conference on Recent Advances in Natural Language Processing (RANLP–2007), Borovets, Bulgaria, pp. 525–529 (2007)
Google Scholar
Almansor, E.H., Al-Ani, A., Al, A.: Translating dialectal Arabic as low resource language using word embedding. In: RANLP, pp. 52–57 (2017)
Google Scholar
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016)
Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource translation with multi-lingual neural machine translation. arXiv preprint arXiv:1606.04164 (2016)
Wu, J., Hou, H., Shen, Z., Du, J., Li, J.: Adapting attention-based neural network to low-resource Mongolian-Chinese machine translation. In: International Conference on Computer Processing of Oriental Languages, pp. 470–480. Springer (2016)
Google Scholar
Alqudsi, A., Omar, N., Shaker, K.: Arabic machine translation: a survey. Artif. Intell. Rev. 1–24 (2014)
Google Scholar
Pennell, D.L., Liu, Y.: Normalization of informal text. Comput. Speech Lang. 28(1), 256–277 (2014)
Article Google Scholar
Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Comput. Speech Lang. 15(3), 287–333 (2001)
Article Google Scholar
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 421–432. Association for Computational Linguistics (2012)
Google Scholar
ElSahar, H., El-Beltagy, S.R.: A fully automated approach for Arabic slang lexicon extraction from microblogs. In: CICLing (1), pp. 79–91 (2014)
Google Scholar
Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2427–2430. ACM (2012)
Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10(3), 157–174 (2007)
Article Google Scholar
Deepak, P., Subramaniam, V.: Correcting SMS text automatically (2012)
Google Scholar
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pp. 71–78. Association for Computational Linguistics (2009)
Google Scholar
Whitelaw, C., Hutchinson, B., Chung, G.Y., Ellis, G.: Using the web for language independent spellchecking and autocorrection. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 890–899. Association for Computational Linguistics (2009)
Google Scholar
Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779. Association for Computational Linguistics (2010)
Google Scholar
Aransa, W.: Statistical machine translation of the Arabic language. Ph.D. thesis, Université du Maine, Le Mans, France (2015)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
Google Scholar
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368–378. Association for Computational Linguistics (2011)
Google Scholar
Pennell, D., Liu, Y.: Toward text message normalization: modeling abbreviation generation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5364–5367. IEEE (2011)
Google Scholar
Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)
Google Scholar
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 33–40. Association for Computational Linguistics (2006)
Google Scholar
Hernández, A.: A ngram-based statistical machine translation approach for text normalization on chat-speak style communications (2009)
Google Scholar
Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 189–196. Association for Computational Linguistics (2010)
Google Scholar
Okpor, M.: Machine translation approaches: issues and challenges. Int. J. Comput. Sci. Issues (IJCSI) 11(5), 159 (2014)
Google Scholar
Salem, Y., Hensman, A., Nolan, B.: Implementing Arabic-to-English machine translation using the role and reference grammar linguistic model (2008)
Google Scholar
Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics (2006)
Google Scholar
Lee, Y.S., Papineni, K., Roukos, S., Emam, O., Hassan, H.: Language model based Arabic word segmentation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 399–406. Association for Computational Linguistics (2003)
Google Scholar
Attia, M.: Developing a robust Arabic morphological transducer using finite state technology. In: 8th Annual CLUK Research Colloquium, pp. 9–18 (2005)
Google Scholar
Attia, M.A.: Arabic tokenization system. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 65–72. Association for Computational Linguistics (2007)
Google Scholar
Phillips, A.B., Cavalli-Sforza, V.: Arabic-to-English example based machine translation using context-insensitive morphological analysis. Journées d’Etudes sur le Traitement Automatique de la Langue Arabe (2006)
Google Scholar
Alansary, S., Nagi, M., Adly, N.: Towards analyzing the international corpus of Arabic (ICA): progress of morphological stage. In: 8th International Conference on Language Engineering, Egypt, pp. 1–23 (2008)
Google Scholar
Köprü, S., Miller, J.: A unification based approach to the morphological analysis and generation of Arabic. In: Farghaly, A., Megerdoomian, K., Sawaf, H. (eds.) 3rd Workshop on Computational Approaches to Arabic Script-based Languages at MT Summit XII. IAMT, Ottowa (2009)
Google Scholar
Habash, N.: Introduction to Arabic natural language processing. In: Tutoriel in the ACL 43th Annual Meeting (2005)
Google Scholar
Bisazza, A., Federico, M.: Chunk-based verb reordering in VSO sentences for Arabic-English statistical machine translation. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pp. 235–243. Association for Computational Linguistics (2010)
Google Scholar
Carpuat, M., Marton, Y., Habash, N.: Improving Arabic-to-English statistical machine translation by reordering post-verbal subjects for alignment. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 178–183. Association for Computational Linguistics (2010)
Google Scholar
Shirko, O., Omar, N., Arshad, H., Albared, M.: Machine translation of noun phrases from Arabic to English using transfer-based approach. J. Comput. Sci. 6(3), 350 (2010)
Article Google Scholar
Habash, N.Y., Rambow, O.C., Chiang, D., Diab, M., Hwa, R., Sima’an, K., Lacey, V., Levy, R., Nichols, C., Shareef, S.: Parsing Arabic dialects (2006)
Google Scholar
Bakr, H.A., Shaalan, K., Ziedan, I.: A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In: The 6th International Conference on Informatics and Systems, infos2008, Cairo University (2008)
Google Scholar
Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)
Google Scholar
Salloum, W., Habash, N.: Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 10–21. Association for Computational Linguistics (2011)
Google Scholar
Mohamed, E., Mohit, B., Oflazer, K.: Transforming standard Arabic to colloquial Arabic. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 176–180. Association for Computational Linguistics (2012)
Google Scholar
Al-Sabbagh, R., Girju, R.: Mining the web for the induction of a dialectical Arabic lexicon. In: LREC (2010)
Google Scholar
Boujelbane, R., Khemakhem, M.E., Belguith, L.H.: Mapping rules for building a Tunisian dialect lexicon and generating corpora. In: IJCNLP, pp. 419–428 (2013)
Google Scholar
El-taher, F.E.Z., Hammouda, A.A., Abdel-Mageid, S.: Automation of understanding textual contents in social networks. In: 2016 International Conference on Selected Topics in Mobile & Wireless Networking (MoWNeT), pp. 1–7. IEEE (2016)
Google Scholar
Charoenpornsawat, P., Sornlertlamvanich, V., Charoenporn, T.: Improving translation quality of rule-based machine translation. In: Proceedings of the 2002 COLING Workshop on Machine Translation in Asia, vol. 16, pp. 1–6. Association for Computational Linguistics (2002)
Google Scholar
Stalls, B.G., Knight, K.: Translating names and technical terms in Arabic text. In: Proceedings of the Workshop on Computational Approaches to Semitic Languages, pp. 34–41. Association for Computational Linguistics (1998)
Google Scholar
Lee, Y.S.: Morphological analysis for statistical machine translation. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 57–60. Association for Computational Linguistics (2004)
Google Scholar
Hasan, S., El Isbihani, A., Ney, H.: Creating a large-scale Arabic to French statistical machine translation system. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 855–858 (2006)
Google Scholar
Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 173–181. Association for Computational Linguistics (2009)
Google Scholar
Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 153–156. Association for Computational Linguistics (2008)
Google Scholar
Riesa, J., Yarowsky, D.: Minimally supervised morphological segmentation with applications to machine translation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA06), pp. 185–192 (2006)
Google Scholar
Mansour, S.: MorphTagger: HMM-based Arabic segmentation for statistical machine translation. In: International Workshop on Spoken Language Translation (IWSLT) 2010 (2010)
Google Scholar
Sajjad, H., Darwish, K., Belinkov, Y.: Translating dialectal Arabic to English. In: ACL, vol. 2, pp. 1–6 (2013)
Google Scholar
Sajjad, H., Durrani, N., Guzman, F., Nakov, P., Abdelali, A., Vogel, S., Salloum, W., Kholy, A.E., Habash, N.: Egyptian Arabic to English statistical machine translation system for NIST OpenMT 2015. arXiv preprint arXiv:1606.05759 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 (2016)
Ling, W., Luís, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W., Trancoso, I.: Finding function in form: compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)
Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMs. arXiv preprint arXiv:1508.00657 (2015)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Google Scholar
Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)
Google Scholar
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016)
Google Scholar
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)
Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Google Scholar
Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk, H., Bengio, Y.: On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015)
Ramachandran, P., Liu, P.J., Le, Q.V.: Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683 (2016)
Zhao, S., Zhang, Z.: An efficient character-level neural machine translation. arXiv preprint arXiv:1608.04738 (2016)
Almahairi, A., Cho, K., Habash, N., Courville, A.: First result on Arabic neural machine translation. arXiv preprint arXiv:1606.02680 (2016)
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1, pp. 1370–1380 (2014)
Google Scholar
Setiawan, H., Huang, Z., Devlin, J., Lamar, T., Zbib, R., Schwartz, R., Makhoul, J.: Statistical machine translation features with multitask tensor networks. arXiv preprint arXiv:1506.00698 (2015)
Almansor, E.H., Al-Ani, A.: A hybrid neural machine translation technique for translating low resource languages. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 347–356. Springer (2018)
Google Scholar
Costa-Jussà, M.R., Fonollosa, J.A.R.: Character-based neural machine translation. CoRR abs/1603.00810 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia
Ebtesam H. Almansor, Ahmed Al-Ani & Farookh Khadeer Hussain
Community College, Najran University, Najran, Saudi Arabia
Ebtesam H. Almansor

Authors

Ebtesam H. Almansor
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Al-Ani
View author publications
You can also search for this author in PubMed Google Scholar
Farookh Khadeer Hussain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ebtesam H. Almansor .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Faculty of Information Engineering, Fukuoka, Japan
Leonard Barolli
School of Software, University of Technology Sydney (UTS), Ultimo, NSW, Australia
Farookh Khadeer Hussain
Department of Information and Communication Engineering, Fukuoka Institute of Technology, Faculty of Information Engineering, Fukuoka, Japan
Makoto Ikeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almansor, E.H., Al-Ani, A., Hussain, F.K. (2020). Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions. In: Barolli, L., Hussain, F., Ikeda, M. (eds) Complex, Intelligent, and Software Intensive Systems. CISIS 2019. Advances in Intelligent Systems and Computing, vol 993. Springer, Cham. https://doi.org/10.1007/978-3-030-22354-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-22354-0_17
Published: 21 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22353-3
Online ISBN: 978-3-030-22354-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics