Abstract
Rapid growth in internet technology lead to increase the usage of social media platforms which make communication between users easier. Through the communication users used their daily languages which considered as non-standard language. The non-slandered text contains lots of noise, such as abbreviations, slang which used more in English languages and dialect words which are widely used in Arabic language. These texts face challenging using any natural language processing tools. Therefore, these texts need to be treated and transferred to be similar to their standard form. According to that the normalization and translation approach have been used to transfer the informal text. However, using these approach need large label or parallel datasets. While high resource languages such as English have enough parallel datasets, low resource languages such as Arabic is lack of enough parallel dataset. Therefore, in this paper we focus on the Arabic and Arabic dialects as a low resource language in the era of transferring non-stander text using normalization and translation approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han, B.: Improving the utility of social media with natural language processing. Ph.D. thesis, University of Melbourne, Department of Computing and Information Systems (2014)
Almansor, E.H.: Translating Arabic as low resource language using distribution representation and neural machine translation models. Ph.D. thesis (2018)
Hidayatullah, A.F.: Language tweet characteristics of Indonesian citizens. In: 2015 International Conference on Science and Technology (TICST), pp. 397–401. IEEE (2015)
Ghareb, A.S., Hamdan, A.R., Bakar, A.A., Yaakub, M.R.: Hybrid statistical rule-based classifier for Arabic text mining. J. Theoret. Appl. Inf. Technol. 71(2) (2015)
Shaalan, K., Bakr, H., Ziedan, I.: Transferring Egyptian colloquial dialect into modern standard Arabic. In: International Conference on Recent Advances in Natural Language Processing (RANLP–2007), Borovets, Bulgaria, pp. 525–529 (2007)
Almansor, E.H., Al-Ani, A., Al, A.: Translating dialectal Arabic as low resource language using word embedding. In: RANLP, pp. 52–57 (2017)
Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016)
Firat, O., Sankaran, B., Al-Onaizan, Y., Vural, F.T.Y., Cho, K.: Zero-resource translation with multi-lingual neural machine translation. arXiv preprint arXiv:1606.04164 (2016)
Wu, J., Hou, H., Shen, Z., Du, J., Li, J.: Adapting attention-based neural network to low-resource Mongolian-Chinese machine translation. In: International Conference on Computer Processing of Oriental Languages, pp. 470–480. Springer (2016)
Alqudsi, A., Omar, N., Shaker, K.: Arabic machine translation: a survey. Artif. Intell. Rev. 1–24 (2014)
Pennell, D.L., Liu, Y.: Normalization of informal text. Comput. Speech Lang. 28(1), 256–277 (2014)
Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Comput. Speech Lang. 15(3), 287–333 (2001)
Han, B., Cook, P., Baldwin, T.: Automatically constructing a normalisation dictionary for microblogs. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 421–432. Association for Computational Linguistics (2012)
ElSahar, H., El-Beltagy, S.R.: A fully automated approach for Arabic slang lexicon extraction from microblogs. In: CICLing (1), pp. 79–91 (2014)
Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2427–2430. ACM (2012)
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10(3), 157–174 (2007)
Deepak, P., Subramaniam, V.: Correcting SMS text automatically (2012)
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, pp. 71–78. Association for Computational Linguistics (2009)
Whitelaw, C., Hutchinson, B., Chung, G.Y., Ellis, G.: Using the web for language independent spellchecking and autocorrection. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 890–899. Association for Computational Linguistics (2009)
Beaufort, R., Roekhaut, S., Cougnon, L.A., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779. Association for Computational Linguistics (2010)
Aransa, W.: Statistical machine translation of the Arabic language. Ph.D. thesis, Université du Maine, Le Mans, France (2015)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 368–378. Association for Computational Linguistics (2011)
Pennell, D., Liu, Y.: Toward text message normalization: modeling abbreviation generation. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5364–5367. IEEE (2011)
Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics (2002)
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 33–40. Association for Computational Linguistics (2006)
Hernández, A.: A ngram-based statistical machine translation approach for text normalization on chat-speak style communications (2009)
Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Unsupervised cleansing of noisy text. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 189–196. Association for Computational Linguistics (2010)
Okpor, M.: Machine translation approaches: issues and challenges. Int. J. Comput. Sci. Issues (IJCSI) 11(5), 159 (2014)
Salem, Y., Hensman, A., Nolan, B.: Implementing Arabic-to-English machine translation using the role and reference grammar linguistic model (2008)
Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics (2006)
Lee, Y.S., Papineni, K., Roukos, S., Emam, O., Hassan, H.: Language model based Arabic word segmentation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 399–406. Association for Computational Linguistics (2003)
Attia, M.: Developing a robust Arabic morphological transducer using finite state technology. In: 8th Annual CLUK Research Colloquium, pp. 9–18 (2005)
Attia, M.A.: Arabic tokenization system. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 65–72. Association for Computational Linguistics (2007)
Phillips, A.B., Cavalli-Sforza, V.: Arabic-to-English example based machine translation using context-insensitive morphological analysis. Journées d’Etudes sur le Traitement Automatique de la Langue Arabe (2006)
Alansary, S., Nagi, M., Adly, N.: Towards analyzing the international corpus of Arabic (ICA): progress of morphological stage. In: 8th International Conference on Language Engineering, Egypt, pp. 1–23 (2008)
Köprü, S., Miller, J.: A unification based approach to the morphological analysis and generation of Arabic. In: Farghaly, A., Megerdoomian, K., Sawaf, H. (eds.) 3rd Workshop on Computational Approaches to Arabic Script-based Languages at MT Summit XII. IAMT, Ottowa (2009)
Habash, N.: Introduction to Arabic natural language processing. In: Tutoriel in the ACL 43th Annual Meeting (2005)
Bisazza, A., Federico, M.: Chunk-based verb reordering in VSO sentences for Arabic-English statistical machine translation. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pp. 235–243. Association for Computational Linguistics (2010)
Carpuat, M., Marton, Y., Habash, N.: Improving Arabic-to-English statistical machine translation by reordering post-verbal subjects for alignment. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 178–183. Association for Computational Linguistics (2010)
Shirko, O., Omar, N., Arshad, H., Albared, M.: Machine translation of noun phrases from Arabic to English using transfer-based approach. J. Comput. Sci. 6(3), 350 (2010)
Habash, N.Y., Rambow, O.C., Chiang, D., Diab, M., Hwa, R., Sima’an, K., Lacey, V., Levy, R., Nichols, C., Shareef, S.: Parsing Arabic dialects (2006)
Bakr, H.A., Shaalan, K., Ziedan, I.: A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In: The 6th International Conference on Informatics and Systems, infos2008, Cairo University (2008)
Sawaf, H.: Arabic dialect handling in hybrid machine translation. In: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado (2010)
Salloum, W., Habash, N.: Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pp. 10–21. Association for Computational Linguistics (2011)
Mohamed, E., Mohit, B., Oflazer, K.: Transforming standard Arabic to colloquial Arabic. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 176–180. Association for Computational Linguistics (2012)
Al-Sabbagh, R., Girju, R.: Mining the web for the induction of a dialectical Arabic lexicon. In: LREC (2010)
Boujelbane, R., Khemakhem, M.E., Belguith, L.H.: Mapping rules for building a Tunisian dialect lexicon and generating corpora. In: IJCNLP, pp. 419–428 (2013)
El-taher, F.E.Z., Hammouda, A.A., Abdel-Mageid, S.: Automation of understanding textual contents in social networks. In: 2016 International Conference on Selected Topics in Mobile & Wireless Networking (MoWNeT), pp. 1–7. IEEE (2016)
Charoenpornsawat, P., Sornlertlamvanich, V., Charoenporn, T.: Improving translation quality of rule-based machine translation. In: Proceedings of the 2002 COLING Workshop on Machine Translation in Asia, vol. 16, pp. 1–6. Association for Computational Linguistics (2002)
Stalls, B.G., Knight, K.: Translating names and technical terms in Arabic text. In: Proceedings of the Workshop on Computational Approaches to Semitic Languages, pp. 34–41. Association for Computational Linguistics (1998)
Lee, Y.S.: Morphological analysis for statistical machine translation. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 57–60. Association for Computational Linguistics (2004)
Hasan, S., El Isbihani, A., Ney, H.: Creating a large-scale Arabic to French statistical machine translation system. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 855–858 (2006)
Habash, N., Hu, J.: Improving Arabic-Chinese statistical machine translation using English as pivot language. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 173–181. Association for Computational Linguistics (2009)
Badr, I., Zbib, R., Glass, J.: Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 153–156. Association for Computational Linguistics (2008)
Riesa, J., Yarowsky, D.: Minimally supervised morphological segmentation with applications to machine translation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA06), pp. 185–192 (2006)
Mansour, S.: MorphTagger: HMM-based Arabic segmentation for statistical machine translation. In: International Workshop on Spoken Language Translation (IWSLT) 2010 (2010)
Sajjad, H., Darwish, K., Belinkov, Y.: Translating dialectal Arabic to English. In: ACL, vol. 2, pp. 1–6 (2013)
Sajjad, H., Durrani, N., Guzman, F., Nakov, P., Abdelali, A., Vogel, S., Salloum, W., Kholy, A.E., Habash, N.: Egyptian Arabic to English statistical machine translation system for NIST OpenMT 2015. arXiv preprint arXiv:1606.05759 (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 (2016)
Ling, W., LuÃs, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W., Trancoso, I.: Finding function in form: compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)
Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMs. arXiv preprint arXiv:1508.00657 (2015)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)
Santos, C.D., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI, pp. 2741–2749 (2016)
Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk, H., Bengio, Y.: On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015)
Ramachandran, P., Liu, P.J., Le, Q.V.: Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683 (2016)
Zhao, S., Zhang, Z.: An efficient character-level neural machine translation. arXiv preprint arXiv:1608.04738 (2016)
Almahairi, A., Cho, K., Habash, N., Courville, A.: First result on Arabic neural machine translation. arXiv preprint arXiv:1606.02680 (2016)
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1, pp. 1370–1380 (2014)
Setiawan, H., Huang, Z., Devlin, J., Lamar, T., Zbib, R., Schwartz, R., Makhoul, J.: Statistical machine translation features with multitask tensor networks. arXiv preprint arXiv:1506.00698 (2015)
Almansor, E.H., Al-Ani, A.: A hybrid neural machine translation technique for translating low resource languages. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 347–356. Springer (2018)
Costa-Jussà , M.R., Fonollosa, J.A.R.: Character-based neural machine translation. CoRR abs/1603.00810 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Almansor, E.H., Al-Ani, A., Hussain, F.K. (2020). Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions. In: Barolli, L., Hussain, F., Ikeda, M. (eds) Complex, Intelligent, and Software Intensive Systems. CISIS 2019. Advances in Intelligent Systems and Computing, vol 993. Springer, Cham. https://doi.org/10.1007/978-3-030-22354-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-22354-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22353-3
Online ISBN: 978-3-030-22354-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)