Abstract
A language framework for determining the similarity of two snipped texts is proposed. The edit distance concept is employed as a frame algorithm to capture syntactic and semantic similarities. In the proposed work, syntax level distances between lemma-form words are calculated, while partial edit costs are allowed to embed semantic similarity measurements. Many knowledge resources have been used, such as words’ synonyms, negation rules, and word semantic spaces. A researchable Arabic thesaurus dictionary is built in two forms, surface form and lemma form. Semantic word spaces are generated from one of the word embedding models, which represents the words in vector spaces. The algorithm is enhanced to overcome problems with different word orders between sentences by a word permutation technique that elects the best alignment of the snipped text words to yield the best matching score. The algorithm also studied the effect of negation words on textual similarity. The proposed approach was implemented to find the similarity between Arabic language texts. Results are compared with other state-of-the-art algorithms using two benchmark datasets. The experimental results show that the proposed approach achieves a higher Pearson correlation coefficient compared to other works.
Similar content being viewed by others
References
Sowmya V, Raju M, Vardhan BV. Analysis of lexical, syntactic, and semantic features for semantic textual similarity. Int J Comput Eng Technol. 2018;9(5):1–9.
Eminagaoglu M. A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci. 2020;48(4):463–76.
Soares VHA, Campello RJ, Nourashrafeddin S, Milios E, Naldi MC. Combining semantic and term frequency similarities for text clustering. Knowl Inf Syst. 2019;61:1485–516.
Alzahrani S, Aljuaid H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: a study on Arabic-English plagiarism cases. J King Saud Univ-Comput Inf Sci. 2020;34(4):1110–23.
Mahmoud A, Zrigui M. Semantic similarity analysis for corpus development and paraphrase detection in arabic. Int Arab J Inf Technol. 2021;18(1):1–7.
Wali W, Ghorbel F, Gragouri B, Hamdi F, Metais E. A Multilingual Semantic Similarity-Based Approach for Question-Answering Systems. In: Douligeris C, Karagiannis D, Apostolou D, editors. Knowledge Science, Engineering and Management: 12th International Conference, KSEM 2019, Athens, Greece, August 28–30, 2019, Proceedings, Part I. Cham: Springer; 2019.
Wieting J, Berg-Kirkpatrick T, Gimpel K, Neubig G. Beyond BLEU: training neural machine translation with semantic similarity. 2019. arXiv preprint arXiv:1909.06694.
El-Kassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl. 2021;165: 113679.
Po DK. Similarity based information retrieval using Levenshtein distance algorithm. Int J Adv Sci Res Eng. 2020;6(4):6–10.
Richardson R, Smeaton AF, Murphy J. Using WordNet as a knowledge base for measuring semantic similarity. Dublin City Univ. 1994
Alzyadat RAAH. Toward an Arabic Essay Grading Benchmark for Machine Learning (Doctoral dissertation, Middle East University). 2020.
Kuyoro SO, Eluwa JM, Akinsola JE, Ayankoya FY, Omotunde AA, Adegbenjo AA. Intelligent Essay Grading System using Hybrid Text Processing Techniques. Int J Sci Res Comput Sci Eng Inf Technol (IJSRCSEIT), ISSN: 2020 Sep:2456-3307.
Alian M, Awajan A. Arabic sentence similarity based on similarity features and machine learning. Soft Comput. 2021;25(15):10089–101.
Hassan B, AbdelRahman S, Bahgat R, Farag I. FCICU at SemEval-2017 Task 1: sense-based language independent semantic textual similarity approach. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 125–129.
Nagoudi E, Ferrero J, Schwab D, Cherroun H. Word embedding-based approaches for measuring semantic similarity of Arabic-English sentences. In: Lachkar BK, editor. Arabic language processing from theory to practice. ICALP 2017. Communications in computer and information science, vol. 782. Cham: Springer; 2018. p. 19–33.
El-Shishtawy T, El-Ghannam F. An accurate arabic root-based lemmatizer for information retrieval purposes. 2012. arXiv preprint arXiv:1203.3584.
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–10.
Abo-Elghit AH, Al-Zoghby AM, Hamza TT. Textual similarity measurement approaches: a survey (1). Egypt J Lang Eng. 2020;7(2):41–62.
Ezzikouri H, Madani Y, Erritali M, Oukessou M. A new approach for calculating semantic similarity between words using WordNet and set theory. Proc Comput Sci. 2019;151:1261–5.
Ismail S, Shishtawy TE, Alsammak AK. A new alignment word-space approach for measuring semantic similarity for Arabic text. Int J Semant Web Inf Syst (IJSWIS). 2022;18(1):1–18. https://doi.org/10.4018/IJSWIS.297036.
Gomaa WH, Fahmy AA. A survey of text similarity approaches. Int J Comput Appl. 2013;68:13–8.
Alhawarat MO, Abdeljaber H, Hilal A. Effect of stemming on text similarity for Arabic language at sentence level. PeerJ Comput Sci. 2021;7: e530.
Zhang S, Hu Y, Bian G. Research on string similarity algorithm based on Levenshtein Distance. In: 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). IEEE. 2017. pp. 2247–2251
Nagoudi E, Ferrero J, Schwab D. LIM-LIG at SemEval-2017 Task1: Enhancing the semantic similarity for arabic sentences with vectors weighting. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); 2017. pp. 134–138.
Zahran MA, Magooda A, Mahgoub AY, Raafat H, Rashwan M, Atyia A. Word representations in vector space and their applications for arabic. In: International Conference on Intelligent Text Processing and Computational Linguistics. Cham: Springer; 2015. p. 430–43.
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–46.
Toshevska M, Stojanovska F, Kalajdjieski J. Comparative analysis of word embeddings for capturing word similarities. 2020. arXiv preprint arXiv:2005.03812.
Wu H, Huang HY, Jian P, Guo Y, Su C. BIT at SemEval-2017 Task 1: Using semantic information space to evaluate semantic textual similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 77–84.
Navigli R, Ponzetto SP. BabelNet: Building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics. 2010. pp. 216–225.
Al Sulaiman M, Moussa AM, Abdou S, Elgibreen H, Faisal M, Rashwan M. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE. 2022;17(8): e0272991.
Tian J, Zhou Z, Lan M, Wu Y. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). 2017. pp. 191–197.
Henderson J, Merkhofer E, Strickhart L, Zarrella G. MITRE at SemEval-2017 Task 1: Simple semantic similarity. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 185–190.
Ferreira R, Lins RD, Simske SJ, Freitas F, Riss M. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput Speech Lang. 2016;39:1–28.
Alian M, Awajan A. Syntactic-semantic similarity based on dependency tree Kernel. Arab J Sci Eng. 2023;48(8):10937–48.
Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A. Preprocessing Arabic text on social media. Heliyon. 2021;7(2): e06191.
Alraqmiyyat 2013, “Python Functions for Arabic”, Available at https://alraqmiyyat.github.io/2013/01-02.html, (Data accessed 14/02/2022).
Ismail S, Alsammak A, Elshishtawy T. A generic approach for extracting aspects and opinions of Arabic reviews. In: Proceedings of the 10th international conference on informatics and systems. 2016. pp. 173–179.
Qi P, Dozat T, Zhang Y, Manning CD. Universal dependency parsing from scratch. 2019. arXiv preprint arXiv:1901.10457.
Khaled W, Saaad D. “قاموس الطالب في المرادفات والأضداد”, Dar Al Rooky, Egypt. 2012.
Moamen A. A handbook dictionary of synonyms and antonyms”, “معجم المترادفات والأضداد في اللغة الإنجليزية. Dar Al Talae Publishing; 2004.
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893. Accessed 19 Feb 2018.
Haralabopoulos G, Torres MT, Anagnostopoulos I, McAuley D. Text data augmentations: permutation, antonyms and negation. Expert Syst Appl. 2021;177: 114769.
Alian M, Awajan A, Al-Hasan A, Akuzhia R. Building Arabic paraphrasing benchmark based on transformation rules. Trans Asian Low-Resour Lang Inf Process. 2021;20(4):1–17.
SemEval Competition 2017, “SemEval-2017 Task1”, available at https://alt.qcri.org/semeval2017/task1/, (Data accessed 14/02/2021)
Funding
There is no funding from any source for this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The author declares no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ismail, S., Alsammak, A. & Elshishtawy, T. A Language Framework for Measuring Semantic and Syntactic Similarity for Arabic Texts. SN COMPUT. SCI. 5, 351 (2024). https://doi.org/10.1007/s42979-024-02691-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02691-x