Nastalique segmentation-based approach for Urdu OCR

Original Paper

DOI: 10.1007/s10032-015-0250-2

Cite this article as:
Hussain, S., Ali, S. & Akram, Q..A. IJDAR (2015) 18: 357. doi:10.1007/s10032-015-0250-2

Abstract

Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, cursiveness, higher context sensitivity and diagonality. This makes the Nastalique writing more complex with multiple letters horizontally overlapping each other. Due to these reasons, existing methods used for Naskh would not work for Nastalique and therefore most work on Nastalique has used non-segmentation methods. The current paper presents new approach for segmentation-based analysis for Nastalique style. The paper explains the complexity of Nastalique, why Naskh based techniques cannot work for Nastalique, and proposes a segmentation-based method for developing Nastalique OCR, deriving principles and techniques for the pre-processing and recognition. The OCR is developed for Urdu language. The system is optimized using 79,093 instances of 5249 main bodies derived from a corpus of 18 million words, giving recognition accuracy of 97.11 %. The system is then tested on document images of books with 87.44 % main body recognition accuracy. The work is extensible to other languages using Nastalique.

Keywords

Urdu OCR Segmentation-based Classification and recognition HMMs Arabic Naskh Nastalique Grapheme inventory Ligatures Main bodies DCTs 

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Center for Language Engineering, Al-Khawarizmi Institute of Computer ScienceUniversity of Engineering and TechnologyLahorePakistan