ArSphere: Arabic word vectors embedded in a polar sphere

Rizkallah, Sandra; Atiya, Amir F.; Shaheen, Samir; Mahgoub, Hossam ElDin

doi:10.1007/s10772-022-09966-9

ArSphere: Arabic word vectors embedded in a polar sphere

Published: 03 March 2022

Volume 26, pages 95–111, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Sandra Rizkallah ORCID: orcid.org/0000-0001-7736-8944¹,
Amir F. Atiya¹,
Samir Shaheen¹ &
…
Hossam ElDin Mahgoub²

223 Accesses
Explore all metrics

Abstract

Word embeddings mean the mapping of words into vectors in an N-dimensional space. ArSphere: is an approach that designs word embeddings for the Arabic language. This approach overcomes one of the shortcomings of word embeddings (for English language too), namely their inability to handle opposites (and differentiate those from unrelated word pairs). To achieve that goal the vectors are embedded onto the unit sphere, rather than onto the entire space. The sphere embedding is suitable in the sense that polarity can be addressed by embedding vectors at opposite poles of the sphere. The proposed approach has several advantages. It utilizes the extensive resources developed by linguistic experts, including classic dictionaries. This is in contrast to the prevailing approach of designing the word embedding using the concept of word co-occurrence. Another advantage is that it is successful in distinguishing between synonyms, antonyms and unrelated word pairs. An algorithm to design the word embedding has been derived, and it is a simple relaxation algorithm. Being a fast algorithm allows easy update of the word vector collection, when adding new words or synonyms. The vectors are tested against a number of other published models and the results show that ArSphere outperforms the other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Al-Ayyoub, M., Essa, S. B., & Alsmadi, I. (2015). Lexiconbased sentiment analysis of arabic tweets. IJSNM, 2(2), 101–114.
Article Google Scholar
Al-Azani, S., & El-Alfy, E. S. M. (2017a). Hybrid deep learning for sentiment polarity determination of arabic microblogs (pp. 491–500). New York: Springer.
Google Scholar
Al-Azani, S., & El-Alfy, E. S. M. (2017b). Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short arabic text. Procedia Computer Science, 109, 359–366.
Article Google Scholar
Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed word representations for multilingual nlp. arXiv preprint arXiv:13071662.
Alashri, S., Alzahrani, S., Alhoshan, M., Alkhanen, I., Alghunaim, S., & Alhassoun, M. (2019). Lexi-augmenter: Lexicon-based model for tweets sentiment analysis. In 2019 IEEE international conference on computational science and engineering (CSE) and IEEE international conference on embedded and ubiquitous computing (EUC), IEEE (pp. 7–10).
Altowayan, A. A., & Elnagar, A. (2017). Improving arabic sentiment analysis with sentiment-specific embeddings. In 2017 IEEE international conference on big data (big data), IEEE (pp. 4314–4320).
Altowayan, A. A., & Tao, L. (2016). Word embeddings for arabic sentiment analysis. In 2016 IEEE international conference on big data (big data), IEEE (pp. 3820–3825).
Aly, M., & Atiya, A. (2013). Labr: A large scale arabic book reviews dataset. In Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 2: Short Papers) (Vol. 2, pp. 494–498).
Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers, pp. 451–462).
Baly, F., Hajj, H., et al. (2020). Arabert: Transformerbased model for arabic language understanding. In Proceedings of the 4th workshop on open-source arabic corpora and processing tools, with a shared task on offensive language detection (pp. 9–15).
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Boudad, N., Ezzahid, S., Faizi, R., & Thami, R.O.H. (2020). Exploring the use of word embedding and deep learning in arabic sentiment analysis. In M. Ezziyyani (Ed.), Advanced intelligent systems for sustainable development (AI2SD’2019), Springer, Cham (pp. 243–253).
Boujelbane, R., Khemekhem, M. E., & Belguith, L. H. (2013). Mapping rules for building a tunisian dialect lexicon and generating corpora. In Proceedings of the sixth international joint conference on natural language processing (pp. 419–428).
Dahou, A., Xiong, S., Zhou, J., Haddoud, M. H., & Duan, P. (2016). Word embeddings and convolutional neural network for arabic sentiment classification. In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers (pp. 2418–2427).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805.
Diab, M., Al-Badrashiny, M., Aminian, M., Attia, M., Dasigi, P., Elfardy, H., Eskander, R., Habash, N., Hawwari, A., & Salloum, W. (2014). Tharwa: A large scale dialectal arabic-standard arabic-english lexicon. In 9th international conference on language resources and evaluation, LREC 2014, European Language Resources Association (ELRA) (pp. 3782–3789).
Dou, Z., Wei, W., & Wan, X. (2018). Improving word embeddings for antonym detection using thesauri and sentiwordnet. In CCF international conference on natural language processing and Chinese computing. Springer (pp. 67–79).
El Bazi, I., & Laachfoubi, N. (2019). Arabic named entity recognition using deep learning approach. International Journal of Electrical & Computer Engineering, 9(3), 2088–8708.
Google Scholar
El-Beltagy, S. R., & Ali, A. (2013). Open issues in the sentiment analysis of arabic social media: A case study. In 2013 9th international conference on innovations in information technology (IIT), IEEE (pp. 215–220)
El-Beltagy, S. R., Khalil, T., Halaby, A., & Hammad, M. (2016). Combining lexical features and a supervised learning approach for arabic sentiment analysis. In International conference on intelligent text processing and computational linguistics, Springer (pp. 307–319).
Fouad, M. M., Mahany, A., Aljohani, N., Abbasi, R. A., & Hassan, S. U. (2020). Arwordvec: Efficient word embedding models for arabic tweets. Soft Computing, 24(11), 8061–8068.
Article Google Scholar
Ghoniem, R. M., Alhelwa, N., & Shaalan, K. (2019). A novel hybrid genetic-whale optimization model for ontology learning from arabic text. Algorithms, 12(9), 182.
Article Google Scholar
Gomaa, W. H., & Fahmy, A. A. (2014). Automatic scoring for answers to arabic test questions. Computer Speech & Language, 28(4), 833–857.
Article Google Scholar
Gomaa, W. H., Fahmy, A. A., et al. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13–18.
Article Google Scholar
Habibi, M., Weber, L., Neves, M., Wiegandt, D. L., & Leser, U. (2017). Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 33(14), i37–i48.
Article Google Scholar
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system for the arabic language. Computers and the Humanities, 38(4), 397–415.
Article Google Scholar
Hasanzadeh, S., Fakhrahmad, S., & Taheri, M. (2020). Based recommender systems: A proposed rating prediction scheme using word embedding representation of reviews. The Computer Journal. https://doi.org/10.1093/comjnl/bxaa044
Article Google Scholar
Helwe, C., & Elbassuoni, S. (2019). Arabic named entity recognition via deep co-learning. Artificial Intelligence Review, 52(1), 197–215.
Article Google Scholar
Kolyvakis, P., Kalousis, A., & Kiritsis, D. (2018). Deepalignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long Papers) (pp. 787–798).
KS. (2018). Alkhawarizmy software. https://eg.linkedin.com/company/alkhawarizmy-software?trk=public_profile_experience-item_result-card_subtitle-click.
Kumar, C. S. P., & Babu, L. D. D. (2020). Evolving dictionary based sentiment scoring framework for patient authored text. Evolutionary Intelligence 1–11.
Lachraf, R., Echahid, Y., Lakhdar, H., Abdelali, A., Schwab, D., et al. (2019). Arbengvec: Arabic-english crosslingual word embedding model. In Proceedings of the fourth Arabic natural language processing workshop.
Mahgoub, H. E., Hashish, M., & Hassanein, A. T. (1990). A matrix representation of the inflectional forms of arabic words: A study of co-occurrence patterns. In Proceedings of the 13th conference on computational linguistics-Volume 3, Association for Computational Linguistics (pp. 419–421).
Malhas, R., Torki, M., & Elsayed, T. (2016). Qu-ir at semeval 2016 task 3: Learning to rank on arabic community question answering forums with word embedding. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016) (pp. 866–871).
Medved, M., & Hor´ak, A. (2018). Sentence and word embedding employed in open question-answering. In ICAART (2) (pp. 486–492).
Mezghanni, I. B., & Gargouri, F. (2017). Deriving ontological semantic relations between arabic compound nouns concepts. Journal of King Saud University-Computer and Information Sciences, 29(2), 212–228.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:171209405
Mohammad, S. M., Salameh, M., & Kiritchenko, S. (2016). How translation alters sentiment. Journal of Artificial Intelligence Research, 55, 95–130.
Article MathSciNet Google Scholar
Moussa, M. E., Mohamed, E. H., & Haggag, M. H. (2020). A generic lexicon-based framework for sentiment analysis. International Journal of Computers and Applications, 42(5), 463–473.
Article Google Scholar
Nabil, M., Aly, M., & Atiya A. (2015). Astd: Arabic sentiment tweets dataset. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2515–2519).
Nakov, P., Màrquez, L., Moschitti, A., & Mubarak, H. (2019). Arabic community question answering. Natural Language Engineering, 25(1), 5.
Article Google Scholar
Nguyen, K. A., im Walde, S. S., & Vu, N. T. (2016). Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 2: Short Papers) (pp. 454–459).
Omar, A. M. (2008). Modern Arabic language dictionary . Alam El-Kutub.
Google Scholar
Ono, M., Miwa M., & Sasaki, Y. (2015). Word embeddingbased antonym detection using thesauri and distributional information. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 984–989).
Pennington J., Socher R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Rizkallah, S., Atiya, A., Mahgoub, H. E., & Heragy, M. (2018). Dialect versus msa sentiment analysis. In International conference on advanced machine learning technologies and applications, Springer (pp. 605–613).
Rizkallah, S., Atiya, A. F., & Shaheen, S. (2020a). Learning spherical word vectors for opinion mining and applying on hotel reviews. In International conference on intelligent systems design and applications, Springer (pp. 200–211).
Rizkallah, S., Atiya, A. F., & Shaheen, S. (2020b). A polarity capturing sphere for word to vector representation. Applied Sciences, 10(12), 4386.
Article Google Scholar
Rizkallah, S., Atiya, A. F., & Shaheen, S. (2021). New vectorspace embeddings for recommender systems. Applied Sciences, 11(14), 6477.
Article Google Scholar
Rizkallah, S., Atiya, A., & Shaheen, S. (2022). Arcoq: Arabic closest opposite questions dataset. Working Paper
Saad, M. K., & Ashour, W. (2010). Osac: Open source arabic corpus. In Proceedings of the 6th international symposium on electrical and electronics engineering and computer science (pp. 557–562).
Salama, R. A., Youssef, A., & Fahmy, A. (2018). Morphological word embedding for arabic. Procedia Computer Science, 142, 83–93.
Article Google Scholar
Salameh, M., Mohammad, S., & Kiritchenko, S. (2015). Sentiment after translation: A case-study on arabic social media posts. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 767–777).
SemEval. (2017). Semeval-2017 task 1. http://alt.qcri.org/semeval2017/task1/.
Shaalan, K. (2014). A survey of arabic named entity recognition and classification. Computational Linguistics, 40(2), 469–510.
Article Google Scholar
Shen, Y., Rong, W., Jiang, N., Peng, B., Tang, J., & Xiong, Z. (2017). Word embedding based correlation model for question/answer matching. In Thirty-first AAAI conference on artificial intelligence.
Singh, S. K., & Sachan, M. K. (2019). Sentiverb system: Classification of social media text using sentiment analysis. Multimedia Tools and Applications, 78(22), 32109–32136.
Article Google Scholar
Soliman, A. B., Eissa, K., & El-Beltagy, S. R. (2017). Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science, 117, 256–265.
Article Google Scholar
Taj, S., Shaikh, B. B., & Meghji, A. F. (2019). Sentiment analysis of news articles: A lexicon based approach. In 2019 2nd international conference on computing, mathematics and engineering technologies (iCoMET), IEEE (pp. 1–5).
Talafha, B., Ali, M., Za’ter, M. E., Seelawi, H., Tuffaha, I., Samir, M., Farhan, W., & Al-Natsheh, H. T. (2020). Multidialect arabic bert for country-level dialect identification. arXiv preprint arXiv:200705612.
Tubishat, M., Idris, N., & Abushariah, M. A. (2018). Implicit aspect extraction in sentiment analysis: Review, taxonomy, oppportunities, and open challenges. Information Processing & Management, 54(4), 545–563.
Article Google Scholar
Vasile, F., Smirnova, E., & Conneau, A. (2016). Metaprod2vec: Product embeddings using sideinformation for recommendation. In Proceedings of the 10th ACM conference on recommender systems (pp. 225–232).
Wu, Y., Xu, J., Jiang, M., Zhang, Y., & Xu, H. (2015). A study of neural word embeddings for named entity recognition in clinical text. In AMIA annual symposium proceedings, American Medical Informatics Association (Vol. 2015, p. 1326).
Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1006–1011).
Ye, Z., Li, F., & Baldwin, T. (2018). Encoding sentiment information into word vectors for sentiment analysis. In Proceedings of the 27th international conference on computational linguistics (pp. 997–1007).
Yih, W., Zweig, G., & Platt, J. C. (2012). Polarity inducing latent semantic analysis. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics (pp. 1212–1222).
Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word representations in vector space and their applications for arabic. In International conference on intelligent text processing and computational linguistics, Springer (pp. 430–443).
Zhang, J., Salwen, J., Glass, M., & Gliozzo, A. (2014). Word semantic representations using Bayesian probabilistic tensor factorization. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1522–1531).
Zitouni, I. (2014). Natural language processing of semitic languages. Springer.
Book Google Scholar
Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1393–1398)

Download references

Author information

Authors and Affiliations

Computer Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
Sandra Rizkallah, Amir F. Atiya & Samir Shaheen
AlKhawarizmy Software, Giza, Egypt
Hossam ElDin Mahgoub

Authors

Sandra Rizkallah
View author publications
You can also search for this author in PubMed Google Scholar
Amir F. Atiya
View author publications
You can also search for this author in PubMed Google Scholar
Samir Shaheen
View author publications
You can also search for this author in PubMed Google Scholar
Hossam ElDin Mahgoub
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandra Rizkallah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rizkallah, S., Atiya, A.F., Shaheen, S. et al. ArSphere: Arabic word vectors embedded in a polar sphere. Int J Speech Technol 26, 95–111 (2023). https://doi.org/10.1007/s10772-022-09966-9

Download citation

Received: 10 February 2021
Accepted: 15 January 2022
Published: 03 March 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10772-022-09966-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ArSphere: Arabic word vectors embedded in a polar sphere

Abstract

Access this article

Similar content being viewed by others

Word2vec for Arabic Word Sense Disambiguation

Arabic Text Classification Based on Word and Document Embeddings

Semantic similarity based approach for reducing Arabic texts dimensionality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ArSphere: Arabic word vectors embedded in a polar sphere

Abstract

Access this article

Similar content being viewed by others

Word2vec for Arabic Word Sense Disambiguation

Arabic Text Classification Based on Word and Document Embeddings

Semantic similarity based approach for reducing Arabic texts dimensionality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation