Abstract
Arabizi is a form of written Arabic which relies on Latin letters, numerals and punctuation rather than Arabic letters. In literature most of the works are concentrated in the study of Arabic neglecting the study of Arabizi. To conduct automatic translation and sentiment analysis, some approaches tend to handle it like any other language while others use a transliteration phase which converts Arabizi into Arabic script. In this context, the main purpose of this study is to determine the utility of Arabizi transliteration in improving automatic translation and sentiment analysis results. We introduce a rule-based automatic transliteration system. Then we apply this system to transliterate a collection of messages before proceeding to machine translation and sentiment analysis tasks. To evaluate the importance of transliteration on these tasks, we also present the construction of a set of lexical resources, such as: a parallel corpus between Arabizi and Modern Standard Arabic (MSA) constructed manually, a sentiment lexicon built automatically and revised manually, and an annotated sentiment corpus constructed automatically based on the sentiment lexicon. We also apply a set of algorithms and models dedicated to machine translation and sentiment analysis, including a number of shallow and deep classifiers as well as different embedding-based models for feature extraction. The experimental results show a consistent improvement after applying transliteration achieving performance results up to 13.06 for automatic translation using the BLEU score and up to 92% for sentiment classification using the F1-score. This study allows to affirm that transliteration is a key factor in Arabizi handling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Code switching: The presence of different language and dialects into the same message.
- 2.
https://en.wikipedia.org/wiki/Arabic_chat_alphabet.
- 3.
- 4.
- 5.
- 6.
ooredooqatar.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
References
I. Guellil, H. Saâdane, F. Azouaou, B. Gueni, D. Nouvel, Arabic natural language processing: an overview. J. King Saud Univ.-Comput. Inf. Sci. (2019)
K. Darwish, Arabizi detection and conversion to Arabic. arXiv preprint arXiv:1306.6755 (2013)
A. Bies, Z. Song, M. Maamouri, S. Grimes, H. Lee, J. Wright, S. Strassel, N. Habash, R. Eskander, O. Rambow, Transliteration of Arabizi into Arabic orthography: developing a parallel annotated Arabizi-Arabic script SMS/chat corpus, in Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, pp. 93–103
R. Cotterell, A. Renduchintala, N. Saphra, C. Callison-Burch, An Algerian Arabic–French code-switched corpus, in Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme, 2014, p. 34
A. Abdelali, K. Darwish, N. Durrani, H. Mubarak, Farasa: a fast and furious segmenter for Arabic, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 2016, pp. 11–16
A. Pasha, M. Al-Badrashiny, M.T. Diab, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow, R. Roth, MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic, in LREC, vol. 14 (2014), pp. 1094–1101
S. Yousfi, S.-A. Berrani, C. Garcia, ALIF: a dataset for Arabic embedded text recognition in TV broadcast, in 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (IEEE, 2015), pp. 1221–1225
G. Inoue, N. Habash, Y. Matsumoto, H. Aoyama, A parallel corpus of Arabic-Japanese news articles, in LREC (2018)
S. Mohammad, M. Salameh, S. Kiritchenko, Sentiment lexicons for Arabic social media, in LREC (2016)
N. Al-Twairesh, H. Al-Khalifa, A. AlSalman, Arasenti: large-scale twitter-specific Arabic sentiment lexicons, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1 (2016), pp. 697–705
K. Darwish, H. Mubarak, A. Abdelali, M. Eldesouki, Y. Samih, R. Alharbi, M. Attia, W. Magdy, L. Kallmeyer, Multi-dialect Arabic pos tagging: a CRF approach, in LREC (2018)
N. Habash, F. Eryani, S. Khalifa, O. Rambow, D. Abdulrahim, A. Erdmann, R. Faraj, W. Zaghouani, H. Bouamor, N. Zalmout et al., Unified guidelines and resources for Arabic dialect orthography, in LREC (2018)
S. Shon, A. Ali, J. Glass, Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567 (2018)
I. Guellil, F. Azouaou, Asda: Analyseur syntaxique du dialecte alg érien dans un but d’analyse s é mantique. arXiv preprint arXiv:1707.08998 (2017)
K. Darwish, Arabizi detection and conversion to Arabic, in Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, pp. 217–224
I. Guellil, F. Azouaou, M. Abbas, S. Fatiha, Arabizi transliteration of Algerian Arabic dialect into modern standard Arabic, in Social MT 2017: First workshop on Social Media and User Generated Content Machine Translation (Co-located with EAMT 2017), 2017
N. Habash, A. Soudi, T. Buckwalter, On Arabic transliteration, in Arabic Computational Morphology (Springer, 2007), pp. 15–22
I. Guellil, A. Faical, Bilingual lexicon for Algerian Arabic dialect treatment in social media, in WiNLP: Women & Underrepresented Minorities in Natural Language Processing (Co-located with ACL 2017) (2017). http://www.winlp.org/wp-content/uploads/2017/final_papers_2017/92_Paper.pdf
M. Al-Badrashiny, R. Eskander, N. Habash, O. Rambow, Automatic transliteration of romanized dialectal Arabic, in Proceedings of the Eighteenth Conference on Computational Natural Language Learning, 2014, pp. 30–38
K. Meftouh, S. Harrat, S. Jamoussi, M. Abbas, K. Smaili, Machine translation experiments on PADIC: a parallel Arabic dialect corpus, in The 29th Pacific Asia Conference on Language, Information and Computation, 2015
G. Kumar, Y. Cao, R. Cotterell, C. Callison-Burch, D. Povey, S. Khudanpur, Translations of the Callhome Egyptian Arabic corpus for conversational speech translation, in IWSLT. Citeseer, 2014
R. Suwaileh, M. Kutlu, N. Fathima, T. Elsayed, M. Lease, Arabicweb16: a new crawl for today’s Arabic web, in Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, 2016), pp. 673–676
M. Rushdi-Saleh, M.T. Martín-Valdivia, L.A. Ureña-López, J.M. Perea-Ortega, OCA: opinion corpus for Arabic. J. Assoc. Inf. Sci. Technol. 62(10), 2045–2054 (2011)
N. Abdulla, S. Mohammed, M. Al-Ayyoub, M. Al-Kabi et al., Automatic lexicon construction for Arabic sentiment analysis, in 2014 International Conference on Future Internet of Things and Cloud (FiCloud) (IEEE, 2014), pp. 547–552
M. Abdul-Mageed, M.T. Diab, AWATIF: a multi-genre corpus for modern standard Arabic subjectivity and sentiment analysis, in LREC. Citeseer, 2012, pp. 3907–3914
M. Aly, A. Atiya, LABR: a large scale Arabic book reviews dataset, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2 (2013), pp. 494–498
G. Badaro, R. Baly, H. Hajj, N. Habash, W. El-Hajj, A large scale Arabic sentiment lexicon for Arabic opinion mining, in Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), 2014, pp. 165–173
S.R. El-Beltagy, NileULex: a phrase and word level sentiment lexicon for Egyptian and modern standard Arabic, in LREC (2016)
M. van der Wees, A. Bisazza, C. Monz, A simple but effective approach to improve Arabizi-to-English statistical machine translation, in Proceedings of the 2nd Workshop on Noisy User-Generated Text (WNUT), 2016, pp. 43–50
J. May, Y. Benjira, A. Echihabi, An Arabizi-English social media statistical machine translation system, in Proceedings of the 11th Conference of the Association for Machine Translation in the Americas, 2014, pp. 329–341
I. Guellil, F. Azouaou, M. Abbas, Comparison between neural and statistical translation after transliteration of Algerian Arabic dialect, in WiNLP: Women & Underrepresented Minorities in Natural Language Processing (Co-located with ACL 2017), 2017
I. Guellil, F. Azouaou, Neural vs statistical translation of Algerian Arabic dialect written with Arabizi and Arabic letter, in The 31st Pacific Asia Conference on Language, Information and Computation PACLIC 31 (2017), 2017
R.M. Duwairi, M. Alfaqeh, M. Wardat, A. Alrabadi, Sentiment analysis for Arabizi text, in 2016 7th International Conference on Information and Communication Systems (ICICS) (IEEE, 2016), pp. 127–132
I. Guellil, A. Adeel, F. Azouaou, A. Hussain, SentiALG: automated corpus annotation for Algerian sentiment analysis. arXiv preprint arXiv:1808.05079 (2018)
S. Medhaffar, F. Bougares, Y. Esteve, L. Hadrich-Belguith, Sentiment analysis of Tunisian dialects: linguistic resources and experiments, in Proceedings of the Third Arabic Natural Language Processing Workshop, 2017, pp. 55–61
I. Guellil, F. Azouaou, H. Saâdane, N. Semmar, Une approche fondée sur les lexiques d’analyse de sentiments du dialecte algérien (2017)
I. Guellil, F. Azouaou, F. Benali, A.-E. Hachani, H. Saadane, Approche hybride pour la translitération de l’arabizi algérien : une étude préliminaire, in Conference: 25e conférence sur le Traitement Automatique des Langues Naturelles (TALN), May 2018, Rennes, FranceAt: Rennes, France (2018). https://www.researchgate.net/publication/326354578_Approche_Hybride_pour_la_transliteration_de_l%27arabizi_algerien_une_etude_preliminaire
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens et al., Moses: open source toolkit for statistical machine translation, in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (Association for Computational Linguistics, 2007), pp. 177–180
S. Al-Azani, E.-S.M. El-Alfy, Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short Arabic text. Procedia Comput. Sci. 109, 359–366 (2017)
A.A. Altowayan, L. Tao, Word embeddings for Arabic sentiment analysis, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, 2016), pp. 3820–3825
A. El Mahdaouy, E. Gaussier, S.O. El Alaoui, Arabic text classification based on word and document embeddings, in International Conference on Advanced Intelligent Systems and Informatics (Springer, 2016), pp. 32–41
A. Barhoumi, Y.E.C. Aloulou, L.H. Belguith, Document Embeddings for Arabic Sentiment Analysis (2017)
A. Dahou, S. Xiong, J. Zhou, M.H. Haddoud, P. Duan, Word embeddings and convolutional neural network for Arabic sentiment classification, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 2418–2427
M. Attia, Y. Samih, A. El-Kahky, L. Kallmeyer, Multilingual multi-class sentiment classification using convolutional neural networks, in LREC (2018)
R. Zbib, E. Malchiodi, J. Devlin, D. Stallard, S. Matsoukas, R. Schwartz, J. Makhoul, O.F. Zaidan, C. Callison-Burch, Machine translation of Arabic dialects, in Proceedings of the 2012 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Association for Computational Linguistics, 2012), pp. 49–59
W. Salloum, N. Habash, Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation, in Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (Association for Computational Linguistics, 2011), pp. 10–21
M. Taboada, J. Brooke, M. Tofiloski, K. Voll, M. Stede, Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011)
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, 2013, pp. 3111–3119
Q. Le, T. Mikolov, Distributed representations of sentences and documents, in International Conference on Machine Learning, 2014, pp. 1188–1196
M. Abdul-Mageed, M.T. Diab, M. Korayem, Subjectivity and sentiment analysis of modern standard Arabic, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, vol. 2 (Association for Computational Linguistics, 2011), pp. 587–591
K. Meftouh, N. Bouchemal, K. Smaïli, A study of a non-resourced language: the case of one of the Algerian dialects, in The third International Workshop on Spoken Languages Technologies for Under-Resourced Languages-SLTU’12, 2012
F.J. Och, H. Ney, A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
K. Heafield, KenLM: faster and smaller language model queries, in Proceedings of the Sixth Workshop on Statistical Machine Translation (Association for Computational Linguistics, 2011), pp. 187–197
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Association for Computational Linguistics, 2002, pp. 311–318
I. Guellil, F. Azouaou, Arabic dialect identification with an unsupervised learning (based on a lexicon). application case: Algerian dialect, in 2016 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES) (IEEE, 2016), pp. 724–731
Acknowledgements
Mr. Mendoza acknowledge funding support from the Millennium Institute for Foundational Research on Data and also by the project BASAL FB0821. The funder played no role in the design of this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Guellil, I., Azouaou, F., Benali, F., Hachani, A.E., Mendoza, M. (2020). The Role of Transliteration in the Process of Arabizi Translation/Sentiment Analysis. In: Abd Elaziz, M., Al-qaness, M., Ewees, A., Dahou, A. (eds) Recent Advances in NLP: The Case of Arabic Language. Studies in Computational Intelligence, vol 874. Springer, Cham. https://doi.org/10.1007/978-3-030-34614-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-34614-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34613-3
Online ISBN: 978-3-030-34614-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)