Skip to main content
Log in

Automatic Building of a Large Arabic Spelling Error Corpus

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Today, for spelling Checker, a classical topic in natural language processing, the corpus has become an important component in the development process, especially with the emergence of stochastic and machine learning approaches that exploit corpus to build resolution models. The aim of our work is based on two phases: the first one is to build a corpus dedicated to the detection and correction of spelling errors in Arabic texts that we call SPIRAL and the second phase is to see the impact of our corpus through an experimental study using a deep learning model which is AraBART. The results obtained using the F1 metric were: 80.2% for morphology error, 81.6% for phonetic error, 73% for physical error, 78.3% for permutation error, 64.3% for keyboard error, 33.7% for delete error, 86% for space-issues error, and 84.5% for tachkil error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability statement

No data was used for the research described in the article.

Notes

  1. A corpus can be defined as a collection of machine-readable authentic texts (including transcripts of spoken data) that is sampled to be representative of a particular natural language or language variety [4].

  2. To download and use the SPIRA corpus, please check out the link below: https://mega.nz/file/sa50BaKL#i9qjD52tt-QzLiKWM-rOQ9XrC1anyOwSOJ_kpzutN7M.

  3. https://www.alriyadh.com/.

  4. https://www.okaz.com.sa/.

  5. https://www.bbc.com/arabic.

  6. https://learning.aljazeera.net/en/generallanguage/level/beginner.

  7. https://al-maktaba.org/.

  8. https://github.com/Dahouabdelhalim/SPIRAL.

  9. Examples of Arabic prefixes: لل, بال, فال, وال, كال, فوال, فبال, وبال, وكال.

  10. The experiment documents are available in our github repository. https://github.com/Dahouabdelhalim/SPIRAL.

References

  1. Bounhas I. On the Usage of a Classical Arabic Corpus as a Language Resource. ACM Transactions on Asian and Low-Resource Language Information Processing. 2019;18(3):1–45. https://doi.org/10.1145/3277591.

    Article  Google Scholar 

  2. Indurkhya N., Damerau F. J.: Handbook of Natural Language Processing. Taylor and Francis Group. (2010).

  3. Dipper, S.: Theory-driven and corpus-driven computational linguistics, and the use of corpora. In A. Ludeling and M. Kyto (eds.), Corpus Linguistics: An International Handbook (Vol. 1), pp. 68–96. (2008)

  4. McEnery A, Xiao R, Tono Y. Corpus-Based Language Studies: An Advanced Resource Book. London, U.K.: Routledge; 2006.

    Google Scholar 

  5. Kukich K. Technique for automatically correcting words in text. In ACM Computing Surveys. 1992;24(4):377–439.

    Article  Google Scholar 

  6. Saty A. A. , Bouzoubaa K., Si Lhoussain A.: Survey of Arabic Checker Techniques. SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, (2020).

  7. Jabar, Y. and Al-Risi, M.: Part of Speech Tagger for Arabic Text Based Support Vector Machines: A Review. ICTACT Journal on Soft Computing: Special Issue on Artificial Intelligence and Deep Learning, Janu, Volume: 09, Issue: 02. (2019)

  8. Elnagar A, Yagi SM, Nassif AB, Shahin I, Salloum SA. Systematic Literature Review of Dialectal Arabic: Identification and Detection. IEEE Access. 2021;9:31010–42. https://doi.org/10.1109/access.2021.3059504.

    Article  Google Scholar 

  9. Belkebir, R., and Habash, N.: Automatic Error Type Annotation for Arabic. arXiv preprint arXiv:2109.08068 (2021).

  10. Shaalan, K.Allam, A. and Gomah, A.: Towards automatic spell checking for arabic. In Proceedings of the 4th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE). pages 21–22. (2003).

  11. Shaalan, K. Aref, R. and Fahmy, A.: An approach for analyzing and correcting spelling errors for non-native arabic learners. In the proceeding of the 7th International Conference on Informatics and Systems (INFOS), pages 1–7. IEEE, (2010).

  12. Mars, M.: Toward a robust spell checker for arabic text. In the proceeding of the International Conference on Computational Science and Its Applications, pages 312–322. Springer, (2016).

  13. Alamri, M. and Teahan, W. J.: A new error annotation for dyslexic texts in arabic. In

  14. Proceedings of the Third Arabic Natural Language Processing Workshop, pages 72–78, (2017).

  15. Lawaye, A. A. and Purkayastha, B.: Design and implementation of spell checker for kashmiri. International Journal of Scientific Research, 5(7), (2016).

  16. Attia, M., pecina, P., samih, Y., shaalan, K., & Van Genabith, J.:. Arabic spelling error detection and correction. Natural Language Engineering, 22(05), 751–773. (2015)

  17. Alkhatib M, Monem AA, Shaalan K. Deep learning for arabic error detection and correction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2020;19(5):1–13.

    Article  Google Scholar 

  18. Noaman, H. M. Sarhan, S. S. and Rashwan, M. A. A.: Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System . Egyptian Computer Science Journal Vol. 40 No.2 (2016)

  19. Shaalan, K. Attia, M. Pecina, P. Samih, Y. and Van Genabith, J.: Arabic Word Generation and Modelling for Spell Checking. In Proceedings of the Eighth International Conference on Language Resources and Evaluation. pages 719–725. (2012).

  20. Hassan, Y. Aly, M. Atiya, A.: Arabic Spelling Correction using Supervised Learning. arXiv:1409.8309. (2014).

  21. Alkanhal, M.I., Al-Badrashiny, M.A., Alghamdi, M.M., Al-Qabbany, A.O.: Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. In: Proceeding of IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no.7, (2012).

  22. Yousfi A, Lhoussain AS, Hicham G, Mohamed N. Spelling correction for the Arabic language space deletion errors-. Procedia Computer Science. 2020;177:568–74. https://doi.org/10.1016/j.procs.2020.10.080.

    Article  Google Scholar 

  23. Habash N., Mohit B., Obeid O., Oflazer k., Tomeh N. and Zaghouani W., : QALB: Qatar Arabic Language Bank. In Proceedings of Qatar Annual Research Conference (2013).

  24. Al-Jefri M. M. and S. A. Mahmoud. :Context-sensitive Arabic spell checker using context words and N-gramlanguage models. In Proceedings of the International Conference on Advances in Information Technology for the Holy Quran and Its Sciences. 258–263. (2013)

  25. Abandah GA, Graves A, Al-Shagoor B, Arabiyat A, Jamour F, Al-Taee M. Automatic diacritization ofArabic text using recurrent neural networks. In Int J Doc Anal Recognit. 2015;18(2):183-p197.

    Article  Google Scholar 

  26. Farwaneh S., and Tamimi M. :Arabic Learners Written Corpus: A resource for research and learning. The Center for Educational Resources in Culture, Language and Literacy. (2012).

  27. Habash, N. Diab, M. and Rambow, O.: Conventional Orthography for Dialectal Arabic. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 711–718. (2012)

  28. Brosh H. Arabic spelling: Errors, perceptions, and strategies. Foreign Lang Ann. 2015;48(4):584–603.

    Article  Google Scholar 

  29. Eddine, M. K., Tomeh, N., Habash, N., Roux, J. L., and Vazirgiannis, M.: AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization. arXiv preprint arXiv:2203.10945. (2022)

  30. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. (2019).

  31. Aichaoui, S.B., Hiri, N., Cheragui, M.A. (2022). SPIRAL: SPellIng eRror Parallel Corpus for Arabic Language. In the proceeding of the Intelligent Systems and Pattern Recognition. ISPR 2022. Communications in Computer and Information Science, vol 1589. Springer, Cham. (2022) https://doi.org/10.1007/978-3-031-08277-1_21

Download references

Funding

This study does not received any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Amine Cheragui.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Recent Trends on Machine Learning & Intelligent Systems" guest edited by Akram Bennour, Tolga Ensari and Abdel-Badeeh Salem.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aichaoui, S.B., Hiri, N., Dahou, A.H. et al. Automatic Building of a Large Arabic Spelling Error Corpus. SN COMPUT. SCI. 4, 108 (2023). https://doi.org/10.1007/s42979-022-01499-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01499-x

Keywords

Navigation