A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Ismail, Shimaa; Alsammak, AbdelWahab; Elshishtawy, Tarek

doi:10.1007/s42979-024-02691-x

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Original Research
Published: 27 March 2024

Volume 5, article number 351, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

34 Accesses
Explore all metrics

Abstract

A language framework for determining the similarity of two snipped texts is proposed. The edit distance concept is employed as a frame algorithm to capture syntactic and semantic similarities. In the proposed work, syntax level distances between lemma-form words are calculated, while partial edit costs are allowed to embed semantic similarity measurements. Many knowledge resources have been used, such as words’ synonyms, negation rules, and word semantic spaces. A researchable Arabic thesaurus dictionary is built in two forms, surface form and lemma form. Semantic word spaces are generated from one of the word embedding models, which represents the words in vector spaces. The algorithm is enhanced to overcome problems with different word orders between sentences by a word permutation technique that elects the best alignment of the snipped text words to yield the best matching score. The algorithm also studied the effect of negation words on textual similarity. The proposed approach was implemented to find the similarity between Arabic language texts. Results are compared with other state-of-the-art algorithms using two benchmark datasets. The experimental results show that the proposed approach achieves a higher Pearson correlation coefficient compared to other works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Natural Language Processing

Word Embedding for Understanding Natural Language: A Survey

References

Sowmya V, Raju M, Vardhan BV. Analysis of lexical, syntactic, and semantic features for semantic textual similarity. Int J Comput Eng Technol. 2018;9(5):1–9.
Google Scholar
Eminagaoglu M. A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci. 2020;48(4):463–76.
Article Google Scholar
Soares VHA, Campello RJ, Nourashrafeddin S, Milios E, Naldi MC. Combining semantic and term frequency similarities for text clustering. Knowl Inf Syst. 2019;61:1485–516.
Article Google Scholar
Alzahrani S, Aljuaid H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: a study on Arabic-English plagiarism cases. J King Saud Univ-Comput Inf Sci. 2020;34(4):1110–23.
Google Scholar
Mahmoud A, Zrigui M. Semantic similarity analysis for corpus development and paraphrase detection in arabic. Int Arab J Inf Technol. 2021;18(1):1–7.
Google Scholar
Wali W, Ghorbel F, Gragouri B, Hamdi F, Metais E. A Multilingual Semantic Similarity-Based Approach for Question-Answering Systems. In: Douligeris C, Karagiannis D, Apostolou D, editors. Knowledge Science, Engineering and Management: 12th International Conference, KSEM 2019, Athens, Greece, August 28–30, 2019, Proceedings, Part I. Cham: Springer; 2019.
Google Scholar
Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G. Beyond BLEU: training neural machine translation with semantic similarity. 2019. arXiv preprint arXiv:1909.06694.
El-Kassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165: 113679.
Article Google Scholar
Po DK. Similarity based information retrieval using Levenshtein distance algorithm. Int J Adv Sci Res Eng. 2020;6(4):6–10.
Google Scholar
Richardson R, Smeaton AF, Murphy J. Using WordNet as a knowledge base for measuring semantic similarity. Dublin City Univ. 1994
Alzyadat RAAH. Toward an Arabic Essay Grading Benchmark for Machine Learning (Doctoral dissertation, Middle East University). 2020.
Kuyoro SO, Eluwa JM, Akinsola JE, Ayankoya FY, Omotunde AA, Adegbenjo AA. Intelligent Essay Grading System using Hybrid Text Processing Techniques. Int J Sci Res Comput Sci Eng Inf Technol (IJSRCSEIT), ISSN: 2020 Sep:2456-3307.
Alian M, Awajan A. Arabic sentence similarity based on similarity features and machine learning. Soft Comput. 2021;25(15):10089–101.
Article Google Scholar
Hassan B, AbdelRahman S, Bahgat R, Farag I. FCICU at SemEval-2017 Task 1: sense-based language independent semantic textual similarity approach. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 125–129.
Nagoudi E, Ferrero J, Schwab D, Cherroun H. Word embedding-based approaches for measuring semantic similarity of Arabic-English sentences. In: Lachkar BK, editor. Arabic language processing from theory to practice. ICALP 2017. Communications in computer and information science, vol. 782. Cham: Springer; 2018. p. 19–33.
Google Scholar
El-Shishtawy T, El-Ghannam F. An accurate arabic root-based lemmatizer for information retrieval purposes. 2012. arXiv preprint arXiv:1203.3584.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.
MathSciNet Google Scholar
Abo-Elghit AH, Al-Zoghby AM, Hamza TT. Textual similarity measurement approaches: a survey (1). Egypt J Lang Eng. 2020;7(2):41–62.
Google Scholar
Ezzikouri H, Madani Y, Erritali M, Oukessou M. A new approach for calculating semantic similarity between words using WordNet and set theory. Proc Comput Sci. 2019;151:1261–5.
Article Google Scholar
Ismail S, Shishtawy TE, Alsammak AK. A new alignment word-space approach for measuring semantic similarity for Arabic text. Int J Semant Web Inf Syst (IJSWIS). 2022;18(1):1–18. https://doi.org/10.4018/IJSWIS.297036.
Article Google Scholar
Gomaa WH, Fahmy AA. A survey of text similarity approaches. Int J Comput Appl. 2013;68:13–8.
Google Scholar
Alhawarat MO, Abdeljaber H, Hilal A. Effect of stemming on text similarity for Arabic language at sentence level. PeerJ Comput Sci. 2021;7: e530.
Article Google Scholar
Zhang S, Hu Y, Bian G. Research on string similarity algorithm based on Levenshtein Distance. In: 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). IEEE. 2017. pp. 2247–2251
Nagoudi E, Ferrero J, Schwab D. LIM-LIG at SemEval-2017 Task1: Enhancing the semantic similarity for arabic sentences with vectors weighting. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); 2017. pp. 134–138.
Zahran MA, Magooda A, Mahgoub AY, Raafat H, Rashwan M, Atyia A. Word representations in vector space and their applications for arabic. In: International Conference on Intelligent Text Processing and Computational Linguistics. Cham: Springer; 2015. p. 430–43.
Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46.
Article Google Scholar
Toshevska M, Stojanovska F, Kalajdjieski J. Comparative analysis of word embeddings for capturing word similarities. 2020. arXiv preprint arXiv:2005.03812.
Wu H, Huang HY, Jian P, Guo Y, Su C. BIT at SemEval-2017 Task 1: Using semantic information space to evaluate semantic textual similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 77–84.
Navigli R, Ponzetto SP. BabelNet: Building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. pp. 216–225.
Al Sulaiman M, Moussa AM, Abdou S, Elgibreen H, Faisal M, Rashwan M. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE. 2022;17(8): e0272991.
Article Google Scholar
Tian J, Zhou Z, Lan M, Wu Y. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). 2017. pp. 191–197.
Henderson J, Merkhofer E, Strickhart L, Zarrella G. MITRE at SemEval-2017 Task 1: Simple semantic similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 185–190.
Ferreira R, Lins RD, Simske SJ, Freitas F, Riss M. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput Speech Lang. 2016;39:1–28.
Article Google Scholar
Alian M, Awajan A. Syntactic-semantic similarity based on dependency tree Kernel. Arab J Sci Eng. 2023;48(8):10937–48.
Article Google Scholar
Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A. Preprocessing Arabic text on social media. Heliyon. 2021;7(2): e06191.
Article Google Scholar
Alraqmiyyat 2013, “Python Functions for Arabic”, Available at https://alraqmiyyat.github.io/2013/01-02.html, (Data accessed 14/02/2022).
Ismail S, Alsammak A, Elshishtawy T. A generic approach for extracting aspects and opinions of Arabic reviews. In: Proceedings of the 10th international conference on informatics and systems. 2016. pp. 173–179.
Qi P, Dozat T, Zhang Y, Manning CD. Universal dependency parsing from scratch. 2019. arXiv preprint arXiv:1901.10457.
Khaled W, Saaad D. “قاموس الطالب في المرادفات والأضداد”, Dar Al Rooky, Egypt. 2012.
Moamen A. A handbook dictionary of synonyms and antonyms”, “معجم المترادفات والأضداد في اللغة الإنجليزية. Dar Al Talae Publishing; 2004.
Google Scholar
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893. Accessed 19 Feb 2018.
Haralabopoulos G, Torres MT, Anagnostopoulos I, McAuley D. Text data augmentations: permutation, antonyms and negation. Expert Syst Appl. 2021;177: 114769.
Article Google Scholar
Alian M, Awajan A, Al-Hasan A, Akuzhia R. Building Arabic paraphrasing benchmark based on transformation rules. Trans Asian Low-Resour Lang Inf Process. 2021;20(4):1–17.
Article Google Scholar
SemEval Competition 2017, “SemEval-2017 Task1”, available at https://alt.qcri.org/semeval2017/task1/, (Data accessed 14/02/2021)

Download references

Funding

There is no funding from any source for this work.

Author information

Authors and Affiliations

Information Systems Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, 13511, Egypt
Shimaa Ismail & Tarek Elshishtawy
Computer Systems Department, Faculty of Engineering Shoubra, Benha University, Benha, Egypt
AbdelWahab Alsammak

Authors

Shimaa Ismail
View author publications
You can also search for this author in PubMed Google Scholar
AbdelWahab Alsammak
View author publications
You can also search for this author in PubMed Google Scholar
Tarek Elshishtawy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shimaa Ismail.

Ethics declarations

Conflict of Interest

The author declares no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ismail, S., Alsammak, A. & Elshishtawy, T. A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts. SN COMPUT. SCI. 5, 351 (2024). https://doi.org/10.1007/s42979-024-02691-x

Download citation

Received: 08 April 2022
Accepted: 08 February 2024
Published: 27 March 2024
DOI: https://doi.org/10.1007/s42979-024-02691-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Word Embedding for Understanding Natural Language: A Survey

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Word Embedding for Understanding Natural Language: A Survey

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation