Automatic Building of a Large Arabic Spelling Error Corpus

Aichaoui, Shaimaa Ben; Hiri, Nawel; Dahou, Abdelhalim Hafedh; Cheragui, Mohamed Amine

doi:10.1007/s42979-022-01499-x

Automatic Building of a Large Arabic Spelling Error Corpus

Original Research
Published: 19 December 2022

Volume 4, article number 108, (2023)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Shaimaa Ben Aichaoui¹,
Nawel Hiri¹,
Abdelhalim Hafedh Dahou² &
…
Mohamed Amine Cheragui¹

134 Accesses
2 Citations
Explore all metrics

Abstract

Today, for spelling Checker, a classical topic in natural language processing, the corpus has become an important component in the development process, especially with the emergence of stochastic and machine learning approaches that exploit corpus to build resolution models. The aim of our work is based on two phases: the first one is to build a corpus dedicated to the detection and correction of spelling errors in Arabic texts that we call SPIRAL and the second phase is to see the impact of our corpus through an experimental study using a deep learning model which is AraBART. The results obtained using the F1 metric were: 80.2% for morphology error, 81.6% for phonetic error, 73% for physical error, 78.3% for permutation error, 64.3% for keyboard error, 33.7% for delete error, 86% for space-issues error, and 84.5% for tachkil error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ChatGPT is bullshit

Article Open access 08 June 2024

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Data availability statement

No data was used for the research described in the article.

Notes

A corpus can be defined as a collection of machine-readable authentic texts (including transcripts of spoken data) that is sampled to be representative of a particular natural language or language variety [4].
To download and use the SPIRA corpus, please check out the link below: https://mega.nz/file/sa50BaKL#i9qjD52tt-QzLiKWM-rOQ9XrC1anyOwSOJ_kpzutN7M.
https://www.alriyadh.com/.
https://www.okaz.com.sa/.
https://www.bbc.com/arabic.
https://learning.aljazeera.net/en/generallanguage/level/beginner.
https://al-maktaba.org/.
https://github.com/Dahouabdelhalim/SPIRAL.
Examples of Arabic prefixes: لل, بال, فال, وال, كال, فوال, فبال, وبال, وكال.
The experiment documents are available in our github repository. https://github.com/Dahouabdelhalim/SPIRAL.

References

Bounhas I. On the Usage of a Classical Arabic Corpus as a Language Resource. ACM Transactions on Asian and Low-Resource Language Information Processing. 2019;18(3):1–45. https://doi.org/10.1145/3277591.
Article Google Scholar
Indurkhya N., Damerau F. J.: Handbook of Natural Language Processing. Taylor and Francis Group. (2010).
Dipper, S.: Theory-driven and corpus-driven computational linguistics, and the use of corpora. In A. Ludeling and M. Kyto (eds.), Corpus Linguistics: An International Handbook (Vol. 1), pp. 68–96. (2008)
McEnery A, Xiao R, Tono Y. Corpus-Based Language Studies: An Advanced Resource Book. London, U.K.: Routledge; 2006.
Google Scholar
Kukich K. Technique for automatically correcting words in text. In ACM Computing Surveys. 1992;24(4):377–439.
Article Google Scholar
Saty A. A. , Bouzoubaa K., Si Lhoussain A.: Survey of Arabic Checker Techniques. SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, (2020).
Jabar, Y. and Al-Risi, M.: Part of Speech Tagger for Arabic Text Based Support Vector Machines: A Review. ICTACT Journal on Soft Computing: Special Issue on Artificial Intelligence and Deep Learning, Janu, Volume: 09, Issue: 02. (2019)
Elnagar A, Yagi SM, Nassif AB, Shahin I, Salloum SA. Systematic Literature Review of Dialectal Arabic: Identification and Detection. IEEE Access. 2021;9:31010–42. https://doi.org/10.1109/access.2021.3059504.
Article Google Scholar
Belkebir, R., and Habash, N.: Automatic Error Type Annotation for Arabic. arXiv preprint arXiv:2109.08068 (2021).
Shaalan, K.Allam, A. and Gomah, A.: Towards automatic spell checking for arabic. In Proceedings of the 4th Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE). pages 21–22. (2003).
Shaalan, K. Aref, R. and Fahmy, A.: An approach for analyzing and correcting spelling errors for non-native arabic learners. In the proceeding of the 7th International Conference on Informatics and Systems (INFOS), pages 1–7. IEEE, (2010).
Mars, M.: Toward a robust spell checker for arabic text. In the proceeding of the International Conference on Computational Science and Its Applications, pages 312–322. Springer, (2016).
Alamri, M. and Teahan, W. J.: A new error annotation for dyslexic texts in arabic. In
Proceedings of the Third Arabic Natural Language Processing Workshop, pages 72–78, (2017).
Lawaye, A. A. and Purkayastha, B.: Design and implementation of spell checker for kashmiri. International Journal of Scientific Research, 5(7), (2016).
Attia, M., pecina, P., samih, Y., shaalan, K., & Van Genabith, J.:. Arabic spelling error detection and correction. Natural Language Engineering, 22(05), 751–773. (2015)
Alkhatib M, Monem AA, Shaalan K. Deep learning for arabic error detection and correction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2020;19(5):1–13.
Article Google Scholar
Noaman, H. M. Sarhan, S. S. and Rashwan, M. A. A.: Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System . Egyptian Computer Science Journal Vol. 40 No.2 (2016)
Shaalan, K. Attia, M. Pecina, P. Samih, Y. and Van Genabith, J.: Arabic Word Generation and Modelling for Spell Checking. In Proceedings of the Eighth International Conference on Language Resources and Evaluation. pages 719–725. (2012).
Hassan, Y. Aly, M. Atiya, A.: Arabic Spelling Correction using Supervised Learning. arXiv:1409.8309. (2014).
Alkanhal, M.I., Al-Badrashiny, M.A., Alghamdi, M.M., Al-Qabbany, A.O.: Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions. In: Proceeding of IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no.7, (2012).
Yousfi A, Lhoussain AS, Hicham G, Mohamed N. Spelling correction for the Arabic language space deletion errors-. Procedia Computer Science. 2020;177:568–74. https://doi.org/10.1016/j.procs.2020.10.080.
Article Google Scholar
Habash N., Mohit B., Obeid O., Oflazer k., Tomeh N. and Zaghouani W., : QALB: Qatar Arabic Language Bank. In Proceedings of Qatar Annual Research Conference (2013).
Al-Jefri M. M. and S. A. Mahmoud. :Context-sensitive Arabic spell checker using context words and N-gramlanguage models. In Proceedings of the International Conference on Advances in Information Technology for the Holy Quran and Its Sciences. 258–263. (2013)
Abandah GA, Graves A, Al-Shagoor B, Arabiyat A, Jamour F, Al-Taee M. Automatic diacritization ofArabic text using recurrent neural networks. In Int J Doc Anal Recognit. 2015;18(2):183-p197.
Article Google Scholar
Farwaneh S., and Tamimi M. :Arabic Learners Written Corpus: A resource for research and learning. The Center for Educational Resources in Culture, Language and Literacy. (2012).
Habash, N. Diab, M. and Rambow, O.: Conventional Orthography for Dialectal Arabic. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 711–718. (2012)
Brosh H. Arabic spelling: Errors, perceptions, and strategies. Foreign Lang Ann. 2015;48(4):584–603.
Article Google Scholar
Eddine, M. K., Tomeh, N., Habash, N., Roux, J. L., and Vazirgiannis, M.: AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization. arXiv preprint arXiv:2203.10945. (2022)
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L.: Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. (2019).
Aichaoui, S.B., Hiri, N., Cheragui, M.A. (2022). SPIRAL: SPellIng eRror Parallel Corpus for Arabic Language. In the proceeding of the Intelligent Systems and Pattern Recognition. ISPR 2022. Communications in Computer and Information Science, vol 1589. Springer, Cham. (2022) https://doi.org/10.1007/978-3-031-08277-1_21

Download references

Funding

This study does not received any funding.

Author information

Authors and Affiliations

Mathematics and Computer Science Department, Ahmed Draia University, Adrar, Algeria
Shaimaa Ben Aichaoui, Nawel Hiri & Mohamed Amine Cheragui
Lorraine University, 54000, Nancy, France
Abdelhalim Hafedh Dahou

Authors

Shaimaa Ben Aichaoui
View author publications
You can also search for this author in PubMed Google Scholar
Nawel Hiri
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhalim Hafedh Dahou
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Amine Cheragui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Amine Cheragui.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Recent Trends on Machine Learning & Intelligent Systems" guest edited by Akram Bennour, Tolga Ensari and Abdel-Badeeh Salem.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aichaoui, S.B., Hiri, N., Dahou, A.H. et al. Automatic Building of a Large Arabic Spelling Error Corpus. SN COMPUT. SCI. 4, 108 (2023). https://doi.org/10.1007/s42979-022-01499-x

Download citation

Received: 12 July 2022
Accepted: 07 November 2022
Published: 19 December 2022
DOI: https://doi.org/10.1007/s42979-022-01499-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Building of a Large Arabic Spelling Error Corpus

Abstract

Access this article

Similar content being viewed by others

ChatGPT is bullshit

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Data availability statement

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic Building of a Large Arabic Spelling Error Corpus

Abstract

Access this article

Similar content being viewed by others

ChatGPT is bullshit

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Data availability statement

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation