Nastalique segmentation-based approach for Urdu OCR

Hussain, Sarmad; Ali, Salman; Akram, Qurat ul Ain

doi:10.1007/s10032-015-0250-2

619 Accesses
25 Citations
Explore all metrics

Abstract

Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, cursiveness, higher context sensitivity and diagonality. This makes the Nastalique writing more complex with multiple letters horizontally overlapping each other. Due to these reasons, existing methods used for Naskh would not work for Nastalique and therefore most work on Nastalique has used non-segmentation methods. The current paper presents new approach for segmentation-based analysis for Nastalique style. The paper explains the complexity of Nastalique, why Naskh based techniques cannot work for Nastalique, and proposes a segmentation-based method for developing Nastalique OCR, deriving principles and techniques for the pre-processing and recognition. The OCR is developed for Urdu language. The system is optimized using 79,093 instances of 5249 main bodies derived from a corpus of 18 million words, giving recognition accuracy of 97.11 %. The system is then tested on document images of books with 87.44 % main body recognition accuracy. The work is extensible to other languages using Nastalique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Segmentation Algorithm for Arabic Handwritten Characters Recognition System

A Novel Approach to Printed Arabic Optical Character Recognition

Article 20 September 2021

Neural Networks Pipeline for Offline Machine Printed Arabic OCR

Article 27 October 2017

Notes

Population collated from http://www.ethnologue.com/region/SAS.
Based on the graphemes used for Nafees Nastalique font for Urdu.
Semi circular stroke having end point traversal path in reverse order of normal writing style.

References

Davis, M., Iancu, L.: Unicode Text Segmentation. Unicode Consortium, Mountain View, USA (2015)
Google Scholar
Naseem, T., Hussain, S.: A novel approach for ranking spelling error corrections for Urdu. Language resources and evaluation, Springer 41, (2007)
Akram, M., Hussain, S.: Word Segmentation for Urdu OCR System. In: 8th Workshop on Asian language resources, COLIG 2010. Beijing (2010)
Hussain, S.: In 12th AMIC Annual Conference on E-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore. www.LICT4D.asia/Fonts/Nafees_Nastalique (2003)
Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: International joint conferences on computer, information, and systems sciences, and engineering (2006)
Hussain, S., Rahman, S., Wali, A., Gulzar, A., Rahman, S.J.: Grammatical analysis of Nastalique writing style of Urdu. Center for Research in Urdu language processing, FAST-nu, Lahore, Pakistan (2002)
Ijaz, M., Hussain, S.: Corpus based Urdu Lexicon development. In Conference on Language Technology, Peshawar (2007)
Shaw, B., Parui, S.K., Shridhar, M.: Offline handwritten Devanagari word recognition: a segmentation based approach. In: 19th international conference on pattern recognition (2008)
Lorigo, L., Govindaraju, V.: Segmentation and pre-recognition of Arabic handwriting. In: Eight international conference on document analysis and recognition (2005)
Cheung, A., Bennamoun, M., Bergmann, N.M.: An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit. 34(2), 215–233 (2001)
Article MATH Google Scholar
Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital image computing on techniques and applications. DC, Washington (2005)
Safabakhsh, R., Abidi, P.: Nastaaligh handwritten word recognition using a continuous-density variable-duration HMM. Arab. J. Sci. Eng. 30, 95–118 (2005)
Google Scholar
Javed, S.T., Hussain, S.: ”Segmentation Based Urdu Nastalique OCR,” in 18th Iberoamerican Congress on Pattern Recognition, Havana CUBA, (2013)
Muaz, A.: Urdu optical character recognition system. MS Thesis Report, National University of Computer and Emerging Sciences, Lahore (2010)
Sankaran, N., Jawahar, C.V.: Recognition of printed Devanagari text using BLSTM Neural Network. In: 21st international conference on pattern recognition (2012)
Al-Muhtaseb, H.A., Mahmoud, S.A., Qahwaji, R.S.: Recognition of off-line printed Arabic text using Hidden Markov models. Signal Process. 88(12), 2902–2912 (2008)
Article MATH Google Scholar
AlKhateeb, J.H., Jiang, J., Ren, J., Khelifi, F., Ipson, S.S.: Multiclass classification of unconstrained handwritten arabic words using machine learning approaches. Open Signal Process. J. 2(1), 21–28 (2009)
Article Google Scholar
AlKhateeb, J.H., Ren, J., Jiang, J., Al-Muhtaseb, H.: Offline handwritten Arabic cursive text recognition using Hidden Markov models and re-ranking. Pattern Recognit. Lett. 32(8), 1081–1088 (2011)
Article Google Scholar
Khorsheed, M.S.: Offline recognition of omnifont Arabic text usingthe HMM ToolKit (HTK). Pattern Recognit. Lett. 28(12), 1563–1571 (2007)
Article Google Scholar
Ul-Hasan, A., Ahmed, S.B., Rashid, S.F., Shafait, F., Breuel, T.M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: International conference on document analysis and recognition (2013)
Sabbour, N., Shafait, F.: A segmentation free approach to Arabic and Urdu OCR. In: SPIE 8658, (2013)
Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Mohsin, H.: Segmentation Free Nastalique Urdu OCR, vol. 46, pp. 456–461. World Academy of Science, Engineering and Technology (2010)
Shah, Z., Saleem, F.: Ligature based optical character recognition of Urdu, Nastaleeq font. In: International multi topic conference. Karachi (2002)
Lehal, G.S., Rana, A.: Recognition of Nastalique Urdu ligatures. In: 4th International workshop on multilingual OCR. NY, New York (2013)
Sattar, S.A.: A technique for the design and implementation of an OCR for printed Nastalique text. Degree of Doctor of Philosophy Thesis Report. N.E.D University of Engineering and Technology, Karachi (2009)
Satti, D.A.: Offline Urdu Nastaliq OCR for printed text using analytical approach. MS Thesis report, Quaid-i-Azam University, Islamabad (2013)
Akram, Q., Hussain, S., Niazi, A., Anjum, U., Irfan, F.: Adapting Tesseract for complex scripts: an example for Urdu Nastalique. In: 11th IAPR Workshop on document analysis systems. Tours (2014)
Rashwan, M.A., Fakhr, M.W., Attia, M., El-Mahallawy, M.: Arabic OCR system analogous to HMM-based ASR systems; implementation and evaluation. J. Eng. Appl. Sci. Cairo 54(6), 653–672 (2007)
Google Scholar
Naz, M., Akram, Q., Hussain, S.: Binarization and its evaluation for Urdu Nastalique document images. In: Proceedings of the 16th international multi topic conference, Lahore (2013)
Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: Seventh international conference on document analysis and recognition (2003)
Jiang, Y., Hongwei, G., Chao, L.: A filtering algorithm for removing salt and pepper noise and preserving details of images. In: 6th international conference on wireless communications networking and mobile computing (2010)
Li, F., Fan, J.: Salt and pepper noise removal by adaptive median filter and minimal surface inpainting. In: 2nd international congress on image and signal processing (2009)
Al-Khaffaf, H.S., Talib, A.Z., Abdul, R.: Salt and pepper noise removal from document images. In: Proceedings of the 1st international visual informatics conference on visual informatics (2009)
Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. C–25, 90–93 (1974)
Article MathSciNet Google Scholar
Rabiner, L.R.: Mathematical foundations of hidden Markov models. In: Proceedings of the NATO advanced study institute on recent advances in speech understanding and dialog systems (1988)
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Sun Microsystems Inc, Mountain View, CA (2004)
CLE, Text corpora. In: Center for Language Engineering, 10 December 2014. http://www.cle.org.pk/clestore/index.htm (2014). Accessed 11 May 2015
Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., Parveen, R.: CLE Urdu digest corpus. In: Conference on language and technology 2012 (CLT12), Lahore (2012)
CLE, CLE Urdu image corpus 14 point size. In: Center for Language Engineering, 12 July 2012. http://www.cle.org.pk/clestore/cleurduimagecorpus14pt.htm (2012). Accessed 11 May 2015

Download references

Acknowledgments

This work is available at www.UrduOCR.net and has been supported by Urdu Nastalique OCR research project Grant by ICTR&D Fund, Ministry of IT, Govt. of Pakistan.

Author information

Authors and Affiliations

Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan
Sarmad Hussain, Salman Ali & Qurat ul Ain Akram

Authors

Sarmad Hussain
View author publications
You can also search for this author in PubMed Google Scholar
Salman Ali
View author publications
You can also search for this author in PubMed Google Scholar
Qurat ul Ain Akram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qurat ul Ain Akram.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hussain, S., Ali, S. & Akram, Q.u.A. Nastalique segmentation-based approach for Urdu OCR. IJDAR 18, 357–374 (2015). https://doi.org/10.1007/s10032-015-0250-2

Download citation

Received: 05 November 2014
Revised: 14 May 2015
Accepted: 30 May 2015
Published: 13 August 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10032-015-0250-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nastalique segmentation-based approach for Urdu OCR

Abstract

Access this article

Similar content being viewed by others

An Efficient Segmentation Algorithm for Arabic Handwritten Characters Recognition System

A Novel Approach to Printed Arabic Optical Character Recognition

Neural Networks Pipeline for Offline Machine Printed Arabic OCR

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Nastalique segmentation-based approach for Urdu OCR

Abstract

Access this article

Similar content being viewed by others

An Efficient Segmentation Algorithm for Arabic Handwritten Characters Recognition System

A Novel Approach to Printed Arabic Optical Character Recognition

Neural Networks Pipeline for Offline Machine Printed Arabic OCR

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation