Skip to main content

A hybrid model for spelling error detection and correction for Urdu language

Abstract

Detecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors occur when the misspelled words do not exist in a dictionary and are meaningless, while contextual errors occur when the words do exist in the dictionary, but their use is not as intended by the writer. This paper presents an “Urdu Spell Checker” that detects incorrect spellings of a word using widely used lexicon lookup approach and provides a list of candidate words containing correct spellings by applying the edit distance technique which covers all types of spelling errors. To identify the best candidate word, this paper proposes a hybrid model that ranks the words in the candidate word list. Multiple ranking techniques such as Soundex, Shapex, LCS and N-gram are used standalone, as well in combination, to determine the best technique in terms of F1 score. A dictionary containing 48,551 words is developed from UMC corpus and Urdu newspaper corpus. Our hybrid model achieves an F1 score of 94.02% when considering top five suggested words and an F1 score of 88.29% when considering top one suggested word.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. http://ufal.mff.cuni.cz/umc/005-en-ur/.

References

  1. Abramovici S (1983) Errors in proofreading: Evidence for syntactic control of letter processing? Memory Cogn 11(3):258–261

    Article  Google Scholar 

  2. Ahmad Z, Orakzai J, Shamsher I, Adnan A (2007) Urdu Nastaleeq optical character recognition. In: Proceedings of world academy of science, engineering and technology, pp 249–252

  3. Alkhatib M, Monem A, Shaalan K (2020) Deep learning for Arabic error detection and correction. ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 19(5):1–13

    Article  Google Scholar 

  4. Aziz R, Anwar M (2020) Urdu Spell checker: a scarce resource language. In: International conference on intelligent technologies and applications, pp 471–483

  5. Azmi A, Almutery M, Aboalsamh H (2019) Real-word errors in Arabic texts: A better algorithm for detection and correction. IEEE/ACM Trans Audio, Speech, Lang Process 27(8):1308–1320

    Article  Google Scholar 

  6. Barari L, QasemiZadeh B (2005) CloniZER spell checker adaptive language independent spell checker. In: Proc. of the first ICGST international conference on artificial intelligence and machine learning AIML, pp 19–21

  7. Dahar I, Abbas F, Rajput U, Hussain A, Azhar F (2018) An efficient sindhi spelling checker for microsoft word. Int J Comput Sci Netw Security 18(5):144–150

    Google Scholar 

  8. Damerau F (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176

    Article  Google Scholar 

  9. Deorowicz S, Ciura M (2005) Correcting spelling errors by modelling their causes. Int J Appl Math Comput Sci 15(2):275

    Google Scholar 

  10. Eastman C, McLean D (1981) On the need for parsing ill-formed input. Comput Linguist 7(4):257

    Google Scholar 

  11. Etoori P, Chinnakotla M, Mamidi R (2018) Automatic spelling correction for resource-scarce languages using deep learning. In: Proceedings of ACL 2018, student research workshop, pp 146–152

  12. Faili H, Ehsan N, Montazery M, Pilehvar M (2016) Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digit Scholarsh Human 31(1):95–117

    Article  Google Scholar 

  13. Hamarashid H, Saeed S, Rashid T (2020) Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji. Neural Comput Appl 33(6):4247–4566

    Google Scholar 

  14. Hanson A, Riseman E, Fisher E (1976) Context in word recognition. Pattern Recogn 8(1):35–45

    Article  Google Scholar 

  15. Hassan Y, Aly M, Atiya A (2014) Arabic spelling correction using supervised learning. In: Proceedings of the EMNLP 2014 workshop on arabic natural language processing (ANLP), pp 121–126

  16. Jurafsky D, Martin J (2018) N-gram language models. Speech Lang Process 23:1–28

    Google Scholar 

  17. Naseem T (2004) A Hybrid Approach for Urdu Spell Checking. Master of Science (Computer Science) thesis at the National University of Computer & Emerging Sciences.

  18. Noaman H, Sarhan S, Rashwan M (2016) Automatic arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system. Egypt Comput Sci J 40(2):1–11

    Google Scholar 

  19. Pollock J, Zamora A (1983) Collection and characterization of spelling errors in scientific and scholarly text. J Am Soc Inf Sci 34(1):51–58

    Article  Google Scholar 

  20. Rasooli M, Kahefi O, Minaei-Bidgoli B (2011) Effect of Adaptive Spell Checking in Persian. In: 2011 7th international conference on natural language processing and knowledge engineering, pp 161–164

  21. Rehman Z, Anwar W, Bajwa UI (2011) Challenges in Urdu text tokenization and sentence boundary disambiguation. In: Proceedings of the 2nd workshop on south southeast asian natural language processing (WSSANLP), pp 40–45

  22. Sardar S, Wahab A (2010) Optical character recognition system for Urdu. In: 2010 international conference on information and emerging technologies, pp 1–5

  23. Shaalan K, Aref R, Fahmy A (2010) An approach for analyzing and correcting spelling errors for non-native Arabic learners. In: 2010 The 7th international conference on informatics and systems (INFOS), pp 1–7

  24. Stauffer R (1949) Chapter III: Research in spelling and handwriting. Rev Educ Res 19(2):118–124

    Google Scholar 

  25. Wint Z, Ducros T, Aritsugi M (2018) Non-words spell corrector of social media data in message filtering systems. J Digit Inf Manage 16(2):1–12

    Google Scholar 

  26. Yazdani A, Ghazisaeedi M, Ahmadinejad N, Giti M, Amjadi H, Nahvijou A (2019) Automated misspelling detection and correction in persian clinical text. J Digit Imaging 33(3):1–8

    Google Scholar 

  27. Zerrouki T, Alhawiti K, Balla A (2014) Autocorrection of arabic common errors for large text corpus. In: Proceedings of the EMNLP 2014 workshop on arabic natural language processing (ANLP), pp 127–131

  28. Zobel JA (1995) Finding approximate matches in large lexicons. Softw Practice Experience 25(3):331–345

    MathSciNet  Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

Romila Aziz: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Writing Original Draft, Investigation. Muhammad Waqas Anwar: Visualization, Supervision, Project Administration, Funding Acquisition, Writing and Review Editing, Investigation, Validation. Muhammad Hasan Jamal: Writing and Review Editing, Investigation, Validation. Usama Ijaz Bajwa: Writing and Review Editing, Validation.

Corresponding author

Correspondence to Muhammad Waqas Anwar.

Ethics declarations

Conflict of interest

We have no financial and personal relationship with other people and organizations.

Availability of Data and Material

Not applicable.

Code Availability

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aziz, R., Anwar, M.W., Jamal, M.H. et al. A hybrid model for spelling error detection and correction for Urdu language. Neural Comput & Applic 33, 14707–14721 (2021). https://doi.org/10.1007/s00521-021-06110-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06110-7

Keywords

  • Spelling errors
  • Candidate words
  • Error detection
  • Error correction
  • Spell checker