Skip to main content
Log in

A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

A language framework for determining the similarity of two snipped texts is proposed. The edit distance concept is employed as a frame algorithm to capture syntactic and semantic similarities. In the proposed work, syntax level distances between lemma-form words are calculated, while partial edit costs are allowed to embed semantic similarity measurements. Many knowledge resources have been used, such as words’ synonyms, negation rules, and word semantic spaces. A researchable Arabic thesaurus dictionary is built in two forms, surface form and lemma form. Semantic word spaces are generated from one of the word embedding models, which represents the words in vector spaces. The algorithm is enhanced to overcome problems with different word orders between sentences by a word permutation technique that elects the best alignment of the snipped text words to yield the best matching score. The algorithm also studied the effect of negation words on textual similarity. The proposed approach was implemented to find the similarity between Arabic language texts. Results are compared with other state-of-the-art algorithms using two benchmark datasets. The experimental results show that the proposed approach achieves a higher Pearson correlation coefficient compared to other works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Sowmya V, Raju M, Vardhan BV. Analysis of lexical, syntactic, and semantic features for semantic textual similarity. Int J Comput Eng Technol. 2018;9(5):1–9.

    Google Scholar 

  2. Eminagaoglu M. A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci. 2020;48(4):463–76.

    Article  Google Scholar 

  3. Soares VHA, Campello RJ, Nourashrafeddin S, Milios E, Naldi MC. Combining semantic and term frequency similarities for text clustering. Knowl Inf Syst. 2019;61:1485–516.

    Article  Google Scholar 

  4. Alzahrani S, Aljuaid H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: a study on Arabic-English plagiarism cases. J King Saud Univ-Comput Inf Sci. 2020;34(4):1110–23.

    Google Scholar 

  5. Mahmoud A, Zrigui M. Semantic similarity analysis for corpus development and paraphrase detection in arabic. Int Arab J Inf Technol. 2021;18(1):1–7.

    Google Scholar 

  6. Wali W, Ghorbel F, Gragouri B, Hamdi F, Metais E. A Multilingual Semantic Similarity-Based Approach for Question-Answering Systems. In: Douligeris C, Karagiannis D, Apostolou D, editors. Knowledge Science, Engineering and Management: 12th International Conference, KSEM 2019, Athens, Greece, August 28–30, 2019, Proceedings, Part I. Cham: Springer; 2019.

    Google Scholar 

  7. Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G. Beyond BLEU: training neural machine translation with semantic similarity. 2019. arXiv preprint arXiv:1909.06694.

  8. El-Kassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165: 113679.

    Article  Google Scholar 

  9. Po DK. Similarity based information retrieval using Levenshtein distance algorithm. Int J Adv Sci Res Eng. 2020;6(4):6–10.

    Google Scholar 

  10. Richardson R, Smeaton AF, Murphy J. Using WordNet as a knowledge base for measuring semantic similarity. Dublin City Univ. 1994

  11. Alzyadat RAAH. Toward an Arabic Essay Grading Benchmark for Machine Learning (Doctoral dissertation, Middle East University). 2020.

  12. Kuyoro SO, Eluwa JM, Akinsola JE, Ayankoya FY, Omotunde AA, Adegbenjo AA. Intelligent Essay Grading System using Hybrid Text Processing Techniques. Int J Sci Res Comput Sci Eng Inf Technol (IJSRCSEIT), ISSN: 2020 Sep:2456-3307.

  13. Alian M, Awajan A. Arabic sentence similarity based on similarity features and machine learning. Soft Comput. 2021;25(15):10089–101.

    Article  Google Scholar 

  14. Hassan B, AbdelRahman S, Bahgat R, Farag I. FCICU at SemEval-2017 Task 1: sense-based language independent semantic textual similarity approach. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 125–129.

  15. Nagoudi E, Ferrero J, Schwab D, Cherroun H. Word embedding-based approaches for measuring semantic similarity of Arabic-English sentences. In: Lachkar BK, editor. Arabic language processing from theory to practice. ICALP 2017. Communications in computer and information science, vol. 782. Cham: Springer; 2018. p. 19–33.

    Google Scholar 

  16. El-Shishtawy T, El-Ghannam F. An accurate arabic root-based lemmatizer for information retrieval purposes. 2012. arXiv preprint arXiv:1203.3584.

  17. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.

    MathSciNet  Google Scholar 

  18. Abo-Elghit AH, Al-Zoghby AM, Hamza TT. Textual similarity measurement approaches: a survey (1). Egypt J Lang Eng. 2020;7(2):41–62.

    Google Scholar 

  19. Ezzikouri H, Madani Y, Erritali M, Oukessou M. A new approach for calculating semantic similarity between words using WordNet and set theory. Proc Comput Sci. 2019;151:1261–5.

    Article  Google Scholar 

  20. Ismail S, Shishtawy TE, Alsammak AK. A new alignment word-space approach for measuring semantic similarity for Arabic text. Int J Semant Web Inf Syst (IJSWIS). 2022;18(1):1–18. https://doi.org/10.4018/IJSWIS.297036.

    Article  Google Scholar 

  21. Gomaa WH, Fahmy AA. A survey of text similarity approaches. Int J Comput Appl. 2013;68:13–8.

    Google Scholar 

  22. Alhawarat MO, Abdeljaber H, Hilal A. Effect of stemming on text similarity for Arabic language at sentence level. PeerJ Comput Sci. 2021;7: e530.

    Article  Google Scholar 

  23. Zhang S, Hu Y, Bian G. Research on string similarity algorithm based on Levenshtein Distance. In: 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). IEEE. 2017. pp. 2247–2251

  24. Nagoudi E, Ferrero J, Schwab D. LIM-LIG at SemEval-2017 Task1: Enhancing the semantic similarity for arabic sentences with vectors weighting. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); 2017. pp. 134–138.

  25. Zahran MA, Magooda A, Mahgoub AY, Raafat H, Rashwan M, Atyia A. Word representations in vector space and their applications for arabic. In: International Conference on Intelligent Text Processing and Computational Linguistics. Cham: Springer; 2015. p. 430–43.

    Google Scholar 

  26. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46.

    Article  Google Scholar 

  27. Toshevska M, Stojanovska F, Kalajdjieski J. Comparative analysis of word embeddings for capturing word similarities. 2020. arXiv preprint arXiv:2005.03812.

  28. Wu H, Huang HY, Jian P, Guo Y, Su C. BIT at SemEval-2017 Task 1: Using semantic information space to evaluate semantic textual similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 77–84.

  29. Navigli R, Ponzetto SP. BabelNet: Building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. pp. 216–225.

  30. Al Sulaiman M, Moussa AM, Abdou S, Elgibreen H, Faisal M, Rashwan M. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE. 2022;17(8): e0272991.

    Article  Google Scholar 

  31. Tian J, Zhou Z, Lan M, Wu Y. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). 2017. pp. 191–197.

  32. Henderson J, Merkhofer E, Strickhart L, Zarrella G. MITRE at SemEval-2017 Task 1: Simple semantic similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 185–190.

  33. Ferreira R, Lins RD, Simske SJ, Freitas F, Riss M. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput Speech Lang. 2016;39:1–28.

    Article  Google Scholar 

  34. Alian M, Awajan A. Syntactic-semantic similarity based on dependency tree Kernel. Arab J Sci Eng. 2023;48(8):10937–48.

    Article  Google Scholar 

  35. Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A. Preprocessing Arabic text on social media. Heliyon. 2021;7(2): e06191.

    Article  Google Scholar 

  36. Alraqmiyyat 2013, “Python Functions for Arabic”, Available at https://alraqmiyyat.github.io/2013/01-02.html, (Data accessed 14/02/2022).

  37. Ismail S, Alsammak A, Elshishtawy T. A generic approach for extracting aspects and opinions of Arabic reviews. In: Proceedings of the 10th international conference on informatics and systems. 2016. pp. 173–179.

  38. Qi P, Dozat T, Zhang Y, Manning CD. Universal dependency parsing from scratch. 2019. arXiv preprint arXiv:1901.10457.

  39. Khaled W, Saaad D. “قاموس الطالب في المرادفات والأضداد”, Dar Al Rooky, Egypt. 2012.

  40. Moamen A. A handbook dictionary of synonyms and antonyms”, “معجم المترادفات والأضداد في اللغة الإنجليزية. Dar Al Talae Publishing; 2004.

    Google Scholar 

  41. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893. Accessed 19 Feb 2018.

  42. Haralabopoulos G, Torres MT, Anagnostopoulos I, McAuley D. Text data augmentations: permutation, antonyms and negation. Expert Syst Appl. 2021;177: 114769.

    Article  Google Scholar 

  43. Alian M, Awajan A, Al-Hasan A, Akuzhia R. Building Arabic paraphrasing benchmark based on transformation rules. Trans Asian Low-Resour Lang Inf Process. 2021;20(4):1–17.

    Article  Google Scholar 

  44. SemEval Competition 2017, “SemEval-2017 Task1”, available at https://alt.qcri.org/semeval2017/task1/, (Data accessed 14/02/2021)

Download references

Funding

There is no funding from any source for this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shimaa Ismail.

Ethics declarations

Conflict of Interest

The author declares no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ismail, S., Alsammak, A. & Elshishtawy, T. A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts. SN COMPUT. SCI. 5, 351 (2024). https://doi.org/10.1007/s42979-024-02691-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-02691-x

Keywords

Navigation