A novel Arabic OCR post-processing using rule-based and word context techniques

Doush, Iyad Abu; Alkhateeb, Faisal; Gharaibeh, Anwaar Hamdi

doi:10.1007/s10032-018-0297-y

A novel Arabic OCR post-processing using rule-based and word context techniques

Original Paper
Published: 05 April 2018

Volume 21, pages 77–89, (2018)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Iyad Abu Doush^1,2,
Faisal Alkhateeb² &
Anwaar Hamdi Gharaibeh²

1047 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The resulted text from the OCR usually does not match the text in the original document. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. The proposed model is language independent and non-constrained with the string length. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The obtained results show that the proposed hybrid system outperforms the rule-based system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abdelraouf, A., Higgins, C.A., Khalil, M.: A database for Arabic printed character recognition. In: A database for Arabic printed character recognition, pp. 567–578. Springer, Berlin (2008)
Abdelraouf, A., Higgins, C.A., Pridmore, T., Khalil, M.: Building a multi-modal Arabic corpus (MMAC). Int. J. Doc. Anal. Recognit. (IJDAR) 13(4), 285–302 (2010)
Article Google Scholar
Abu Doush, I., Al-Trad, A.: Improving post-processing optical character recognition (OCR) documents with Arabic language using spelling error detection and correction. Int. J. Reason.-Based Intell. Syst. 8(4), 91–103 (2015)
Google Scholar
Abu Doush, I., Alkhateeb, F., Al Raoof’bsoul, A.: Semi-automatic generation of Arabic digital talking books. In: 2014 3rd International Conference on User Science and Engineering (i-USEr)
Abu Doush, I., Alkhatib, F., Bsoul, A.A.R.: What we have and what is needed, how to evaluate Arabic Speech Synthesizer? Int. J. Speech Technol. 19(2), 415–432 (2016)
Article Google Scholar
Alginahi, Y.M.: A survey on Arabic character segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 16, 105–126 (2013)
Article Google Scholar
Alkhateeb, F., Abu Doush, I., Albsoul, A.: Arabic optical character recognition software: a review. Pattern Recognit. Image Anal. 27(4), 763–776 (2017)
Article Google Scholar
Alkoffash, M.S., Bawaneh, M.J., Muaidi, H., Alqrainy, S., Alzghool, M.: A survey of digital image processing techniques in character recognition. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 14(3), 65 (2014)
Google Scholar
Amin, A.: Segmentation of printed Arabic text. In: Advances in Pattern Recognition—ICAPR 2001. Springer, Berlin, pp. 115–126 (2001)
Amin, A., Masini, G.: Machine recognition of multifont printed Arabic texts. In: Proceedings of International Conference on Pattern Recognition, Paris, France, pp. 392–395 (1986)
Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. Final Report, JHU Summer Workshop, p. 30 (1999)
Al Azawi, M., Breuel, T. M.: Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 116–120 (2014)
Al Azawi, M., Hasan, A. U., Liwicki, M., Breuel, T. M.: Character-level alignment using WFST and LSTM for post-processing in multi-script recognition systems-a comparative study. In: Image Analysis and Recognition. Springer, Berlin, pp. 379–386 (2014)
Al Azawi, M., Liwicki, M., Breuel, T. M.: WFST-based ground truth alignment for difficult historical documents with text modification and layout variations. In: IS&T/SPIE Electronic Imaging, vol. 8658, pp. 18-865818-12 (2013)
Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using google online spelling suggestion (2012). arXiv preprint arXiv:1204.0191
Beaufort, R., Mancas-Thillou, C.: A weighted finite-state framework for correcting errors in natural scene OCR. Ninth Int. Conf. Doc. Anal. Recognit. 2, 889–893 (2007)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences Proceedings, pp. 21–29 (1997)
Broumandnia, A., Shanbehzadeh, J., Nourani, M.: Segmentation of printed Farsi/Arabic words. In: IEEE/ACS International Conference on Computer Systems and Applications, AICCSA’07, pp. 761–766 (2007)
Chang, J.J., Chen, S.-D.: The postprocessing of optical character recognition based on statistical noisy channel and language model. In: Proceedings of PACLIC, pp. 127–132 (1995)
Dađason, J.F.: Post-correction of Icelandic OCR text. Master’s thesis, Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland (2012)
Gharaibeh, A.: A Hybrid Approach for Arabic OCR Post-Processing Using Rule Based and Word Context Techniques, Master Thesis, Yarmouk University (2016)
Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Data sets for OCR and document image understanding research. In: In Proceedings of the SPIE-Document Recognition IV, pp. 779–799 (1997)
Habeeb, I.Q., Yusof, S.A., Ahmad, F.B.: Two bigrams based language model for auto correction of Arabic OCR errors. Int. J. Digit. Content Technol. Appl. 8(1), 72 (2014)
Google Scholar
Hall, P.A., Dowling, G.R.: Approximate string matching. ACM Comput. Surv. (CSUR) 12(4), 381–402 (1980)
Article MathSciNet Google Scholar
Kalt, T.: A new probabilistic model of text classification and retrieval. Technical Report IR-78. Citeseer (1996)
Kanoun, S., Slimane, F., Guesmi, H., Ingold, R., Alimi, A. M., Hennebert, J.: Affixal approach versus analytical approach for off-line Arabic decomposable vocabulary recognition. In: 10th International Conference on Document Analysis and Recognition ( ICDAR’09), pp. 661–665 (2009)
Khorsheed, M.S.: Off-line Arabic character recognition-a review. Pattern Anal. Appl. 5(1), 31–45 (2002)
Article MathSciNet Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. (CSUR) 24(4), 377–439 (1992)
Article Google Scholar
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., Hassan, H.: Language model based Arabic word segmentation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 399–406 (2003)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Dokl. 10, 707–710 (1966)
MathSciNet MATH Google Scholar
Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. DTIC Document (2005)
Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.-C., Arlandis, J.: Efficient OCR post-processing combining language, hypothesis and error models. In: Structural, Syntactic, and Statistical Pattern Recognition. Springer, Berlin, pp. 728–737 (2010)
Magdy, W., Darwish, K.: Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–414 (2006)
Magdy, W., Darwish, K.: Effect of OCR error correction on Arabic retrieval. Inf. Retr. 11(5), 405–425 (2008)
Article Google Scholar
Mostafa, M.G.: An adaptive algorithm for the automatic segmentation of printed Arabic text. In: 17th National Computer Conference, pp. 437–444 (2004)
Najoua, B.A., Noureddine, E.: A robust approach for Arabic printed character segmentation. Proc. Third Int. Conf. Doc. Anal. Recognit. 2, 865–868 (1995a)
Article Google Scholar
Nayak, M., Nayak, A.K.: Odia running text recognition using moment-based feature extraction and mean distance classification technique. In: Intelligent Computing, Communication and Devices, Springer (2015)
Saad, R., Elanwar, R., Abdel Kader, N., Mashali, S., Betke, M.: BCE-Arabic-v1 dataset: towards interpreting Arabic document images for people with visual impairments. In: PETRA ’16, Corfu Island, Greece (2016)
Schlosser, S.: ERIM Arabic Database. Environmental Research Institute of Michigan, Ann ARbor (2002)
Google Scholar
Schulz, K.U., Mihov, S.: Fast string correction with Levenshtein automata. Int. J. Doc. Anal. Recognit. 5(1), 67–85 (2002)
Article MATH Google Scholar
Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: Database and Evaluation Protocols for Arabic Printed Text Recognition. DIUF-University of Fribourg, Switzerland (2009)
Google Scholar
Slimane, F., Kanoun, S., El Abed, H., Alimi, A. M., Ingold, R., Hennebert, J.: ICDAR2013 competition on multi-font and multi-size digitally represented arabic text. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1433–1437 (2013)
Toselli, A.H., Romero, V., Vidal, E.: Alignment between text images and their transcripts for handwritten documents. In: Language Technology for Cultural Heritage, Springer, Berlin (2011)
Ul-Hasan, A., Bin Ahmed, S., Rashid, F., Shafait, F., Breuel, T. M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1061–1065 (2013)
Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 160–164 (2013)
Yalniz, I.Z.: Efficient representation and matching of texts and images in scanned book collections. Doctoral Dissertations in University of Massachusetts (2014)
Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic ocr evaluation of books. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 754–758 (2011)

Download references

Author information

Authors and Affiliations

Computer Science and Information Systems Department, American University of Kuwait, Salmiya, Kuwait
Iyad Abu Doush
Computer Sciences Department, Yarmouk University, Irbid, 21163, Jordan
Iyad Abu Doush, Faisal Alkhateeb & Anwaar Hamdi Gharaibeh

Authors

Iyad Abu Doush
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Alkhateeb
View author publications
You can also search for this author in PubMed Google Scholar
Anwaar Hamdi Gharaibeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iyad Abu Doush.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doush, I.A., Alkhateeb, F. & Gharaibeh, A.H. A novel Arabic OCR post-processing using rule-based and word context techniques. IJDAR 21, 77–89 (2018). https://doi.org/10.1007/s10032-018-0297-y

Download citation

Received: 06 February 2017
Revised: 14 February 2018
Accepted: 22 March 2018
Published: 05 April 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10032-018-0297-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel Arabic OCR post-processing using rule-based and word context techniques

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Printed Arabic Optical Character Recognition

Arabic optical character recognition software: A review

Online Recognition of Arabic Handwritten Words System Based on Alignments Matching Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel Arabic OCR post-processing using rule-based and word context techniques

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Printed Arabic Optical Character Recognition

Arabic optical character recognition software: A review

Online Recognition of Arabic Handwritten Words System Based on Alignments Matching Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation