Abstract
Although optical character recognition of printed texts has been a focus of research for the last few decades, Arabic printed text, being cursive, still poses a challenge. The challenge is twofold: segmenting words into letters and identifying individual letters. We describe a method that combines the two tasks, using multiple grids of SIFT descriptors as features. To construct a classifier, we do not use a large training set of images with corresponding ground truth, a process usually done to construct a classifier, but, rather, an image containing all possible symbols is created and a classifier is constructed by extracting the features of each symbol. To recognize the text inside an image, the image is split into “pieces of Arabic words”, and each piece is scanned with increasing window sizes. Segmentation points are set where the classifier achieves maximal confidence. Using the fact that Arabic has four forms of letters (isolated, initial, medial and final), we narrow the search space based on the location inside the piece.
The performance of the proposed method, when applied to printed texts and computer fonts of different sizes, was evaluated on two independent benchmarks, PATS and APTI. Our algorithm outperformed that of the creator of PATS on five out of eight fonts, achieving character correctness of 98.87%–100%. On the APTI dataset, ours was competitive or better that the competition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ahmad, R., Amin, S.H., Khan, M.A.: Scale and rotation invariant recognition of cursive Pashto script using SIFT features. In: 6th International Conference on Emerging Technologies, pp. 299–303 (2010)
Ahmed, I., Mahmoud, S.A., Parvez, M.T.: Printed Arabic text recognition. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 147–168. Springer, London (2012)
Al-Muhtaseb, H.A., Mahmoud, S.A., Qahwaji, R.S.R.: Recognition of off-line printed Arabic text using Hidden Markov Models. Signal Processing 88, 2902–2912 (2008)
Diem, M., Sablatni, R.: Recognition of degraded handwritten characters using local features. In: 10th International Conference on Document Analysis and Recognition, pp. 221–225 (2009)
Fujisawa, H.: Forty years of research in character and document recognition-an industrial perspective. Pattern Recognition 41, 2435–2446 (2008)
Gui, J.P., Zhou, Y., Lin, X.D., Chen, K., Guan, H.B.: Research on Chinese character recognition using bag of words. Applied Mechanics and Materials 20-23, 395–400 (2010)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision. ICCV 1999, vol. 2, pp. 1150–1157 (1999)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 42, 91–110 (2004)
Magdy, W., Darwis, K.: Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–411 (2006)
Munroe, R.: Random number, xkcd.com/221
Slimane, F., Ingold, R., Kanoun, S., Alimi, A., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950 (2009)
Slimane, F., Kanoun, S., Abed, H.E., Alimi, A.M., Ingold, R., Hennebert, J.: Arabic recognition competition: Multi-font multi-size digitally represented text. In: Eleventh International Conference on Document Analysis and Recognition, pp. 1449–1453. IEEE (2011)
Steinherz, T., Rivlin, E., Intrator, N.: Offline cursive script word recognition – a survey. International Journal on Document Analysis and Recognition 2, 90–110 (1999)
Wu, T., Qi, K., Zheng, Q., Chen, K., Chen, J., Guan, H.: An improved descriptor for Chinese character recognition. In: Third International Symposium on Intelligent Information Technology Application, pp. 400–403 (2009)
Zahedi, M., Eslami, S.: Farsi/Arabic optical font recognition using SIFT features. Procedia Computer Science 3, 1055–1059 (2011)
Ziaratban, M., Faez, K.: A Novel Two-Stage Algorithm for Baseline Estimation and Correction in Farsi and Arabic Handwritten Text Line. In: 19th International Conference on Pattern Recognition, Tampa, FL, p. 5 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Dershowitz, N., Rosenberg, A. (2014). Arabic Character Recognition. In: Dershowitz, N., Nissan, E. (eds) Language, Culture, Computation. Computing - Theory and Technology. Lecture Notes in Computer Science, vol 8001. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45321-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-45321-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45320-5
Online ISBN: 978-3-642-45321-2
eBook Packages: Computer ScienceComputer Science (R0)