Abstract
This paper presents a new lexicon reduction method for historical Arabic scripts that compares the input subword image with the lexicon entries, and selects the most similar ones. In comparing two subword images, more importance is given to the prominent shape regions, defined as those local regions of a subword that distinguish it from other lexicon subwords. In this method, first a retrieval-based measure is applied to compute a distinction score for each local region, indicating how prominent that region is. These scores are subsequently used in a proposed distance measure to modulate the weights of corresponding shape features, where most distinctive regions are given more weight. A global shape-based lexicon reduction based on the characteristic loci is used as well, to complement the local subword descriptors. We evaluated the performance of our proposed method on the Ibn Sina database, containing more than 12,000 subwords extracted from a historical Arabic document, and the degree of reduction of 98.15 % with an accuracy of 90.15 % was achieved.
Similar content being viewed by others
Notes
Histogram of oriented gradients.
Principal component analysis.
References
AbdulKader, A.: A two-tier arabic offline handwriting recognition based on conditional joining rules. In: Arabic and Chinese Handwriting Recognition, Springer, pp. 70–81 (2008)
Alma’adeed, S., Higgens, C., Elliman, D.: Recognition of off-line handwritten arabic words using hidden markov model approach. In: Pattern Recognition, 2002. Proceedings of 16th International Conference on, IEEE, vol. 3, pp. 481–484 (2002)
Bertolami, R., Gutmann, C., Bunke, H., Spitz, A.: Shape code based lexicon reduction for offline handwritten word recognition. In: Document Analysis Systems, 2008. DAS ’08. The Eighth IAPR International Workshop on, pp. 158–163 (2008)
Brakensiek, A., Rottland, J., Kosmala, A., Rigoll, G.: Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In: 7th International Workshop on Frontiers in Handwritten Recognition, pp. 343–352 (2000)
Chherawala, Y., Cheriet, M.: W-tsv: weighted topological signature vector for lexicon reduction in handwritten arabic documents. Pattern Recognit. 45(9), 3277–3287 (2012)
Chherawala, Y., Cheriet, M.: Arabic word descriptor for handwritten word indexing and lexicon reduction. Pattern Recognit. (2014)
Chherawala, Y., Wisnovsky, R., Cheriet, M.: Tsv-lr: topological signature vector-based lexicon reduction for fast recognition of pre-modern arabic subwords. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, ACM, pp. 6–13 (2011)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE, vol. 1, pp. 886–893 (2005)
Davoudi, H., Kabir, E.: Lexicon reduction for printed farsi subwords using pictorial and textual dictionaries. Int. J. Doc. Anal. Recognit. 17(4), 359–374 (2014)
Downton, A., Tregidgo, R., Kabir, E.: Recognition and verification of handwritten and hand-printer british postal addresses. Int. J. Pattern Recognit. Artif. Intell. 5(1–2), 265–291 (1991)
Ebrahimi, A., Kabir, E.: A pictorial dictionary for printed farsi subwords. Pattern Recognit. Lett. 29(5), 656–663 (2008)
El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: Unconstrained handwritten word recognition using hidden markov models. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 752–760 (1999)
Farooq, F., Bhardwaj, A., Govindaraju, V.: Using topic models for ocr correction. Int. J. Doc. Anal. Recognit. (IJDAR) 12(3), 153–164 (2009)
Farrahi Moghaddam, R., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: Ibn sina: a database for research on processing and understanding of arabic manuscripts images. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ACM, pp. 11–18 (2010)
Fouladi, K., Araabi, B.N., Kabir, E.: A fast and accurate contour-based method for writer-dependent offline handwritten Farsi/Arabic subwords recognition. Int. J. Doc. Anal. Recognit. (IJDAR), 1–23 (2013)
Glucksman, H.A.: Classification of mixed-font alphabetics by characteristic loci. Tech. rep, DTIC Document (1969)
Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: In: Proceedings of the 7th international workshop on frontiers in handwriting recognition, Citeseer (2000)
Järvelin, K., Kekäläinen, J.: Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp. 41–48 (2000)
Kaltenmeier, A., Caesar, T., Gloger, J., Mandler, E.: Sophisticated topology of hidden markov models for cursive script recognition. In: Document Analysis and Recognition, 1993. Proceedings of the Second International Conference on, IEEE, pp. 139–142 (1993)
Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an framework based on quantized feature vectors. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 1097–1101 (1997)
Koerich, A.L., Sabourin, R., Suen, C.Y.: A time—length constrained level building algorithm for large vocabulary handwritten word recognition. In: Advances in Pattern Recognition—ICAPR 2001, Springer, pp. 127–136 (2001)
Lu, Y., Tan, C.L.: Information retrieval in document image databases. Knowl. Data Eng. IEEE Trans. 16(11), 1398–1410 (2004)
Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction. In: Proceedings of International Workshop on Frontiers in Handwriting Recognition, pp. 71–81 (1993)
Madhvanath, S., Srihari, S.: Effective reduction of large lexicons for recognition of offline cursive script. In: Proceedings of 5th International Workshop on Frontiers in Handwriting Recognition, pp. 189–194. Essex, UK (1996)
Madhvanath, S., Krpasundar, V., Govindaraju, V.: Syntactic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)
Märgner, V., El Abed, H.: Guide to OCR for Arabic Scripts. Springer, Berlin (2012)
Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: Pattern Recognition, 2000. Proceedings of 15th International Conference on, IEEE, vol. 3, pp. 463–466 (2000)
Menier, G., Lorette, G.: Lexical analyzer based on a self-organizing feature map. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 1067–1071 (1997)
Milewski, R.J., Govindaraju, V., Bhardwaj, A.: Automatic recognition of handwritten medical forms for search engines. Int. J. Doc. Anal. Recognit. (IJDAR) 11(4), 203–218 (2009)
Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word spotting in historical document images. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, IEEE, pp. 511–515 (2009)
Moghaddam, R.F., Moghaddam, F.F., Cheriet, M.: A new framework based on signature patches, micro registration, and sparse representation for optical text recognition. In: Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on, IEEE, pp. 1259–1265 (2012)
Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition. Pattern Recognit. Lett. 29(6), 724–734 (2008a)
Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Two-stage lexicon reduction for offline Arabic handwritten word recognition. Int. J. Pattern Recognit. Artif. Intell. 22(7), 1323–1341 (2008b)
Palla, S., Lei, H., Govindaraju, V., (2004) Signature and lexiconpruning techniques. In: Frontiers in HandwritingRecognition, 2004. IWFHR-9 2004. Ninth International Workshop on, IEEE, pp. 474–478
Powalka, R., Sherkat, N., Whitrow, R.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)
Shilane, P., Funkhouser, T.: Distinctive regions of 3d surfaces. ACM Trans. Graph. (TOG) 26(2), 7 (2007)
Smeaton, A.F., Spitz, A.L.: Using character shape coding for information retrieval. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 974–978 (1997)
Srihari, S.N.: Recognition of handwritten and machine-printed text for postal address interpretation. Pattern Recognit. Lett. 14(4), 291–302 (1993)
Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without ocr. Pattern Anal. Mach. Intell. IEEE Trans. 24(6), 838–844 (2002)
Wshah, S., Govindaraju, V., Cheng, Y., Li, H.: A novel lexicon reduction method for Arabic handwriting recognition. In: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE, pp. 2865–2868 (2010)
Zimmermann, M., Mao, J.: Lexicon reduction using key characters in cursive handwritten words. Pattern Recognit. Lett. 20(11), 1297–1304 (1999)
Acknowledgments
We thank our colleagues in Synchromedia Laboratory, Reza Farrahi Moghaddam and Youssouf Chherawala, who provide us with their helpful comments on the work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Davoudi, H., Cheriet, M. & Kabir, E. Lexicon reduction of handwritten Arabic subwords based on the prominent shape regions. IJDAR 19, 139–153 (2016). https://doi.org/10.1007/s10032-016-0262-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-016-0262-6