Skip to main content
Log in

Lexicon reduction of handwritten Arabic subwords based on the prominent shape regions

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper presents a new lexicon reduction method for historical Arabic scripts that compares the input subword image with the lexicon entries, and selects the most similar ones. In comparing two subword images, more importance is given to the prominent shape regions, defined as those local regions of a subword that distinguish it from other lexicon subwords. In this method, first a retrieval-based measure is applied to compute a distinction score for each local region, indicating how prominent that region is. These scores are subsequently used in a proposed distance measure to modulate the weights of corresponding shape features, where most distinctive regions are given more weight. A global shape-based lexicon reduction based on the characteristic loci is used as well, to complement the local subword descriptors. We evaluated the performance of our proposed method on the Ibn Sina database, containing more than 12,000 subwords extracted from a historical Arabic document, and the degree of reduction of 98.15 % with an accuracy of 90.15 % was achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. Histogram of oriented gradients.

  2. Principal component analysis.

References

  1. AbdulKader, A.: A two-tier arabic offline handwriting recognition based on conditional joining rules. In: Arabic and Chinese Handwriting Recognition, Springer, pp. 70–81 (2008)

  2. Alma’adeed, S., Higgens, C., Elliman, D.: Recognition of off-line handwritten arabic words using hidden markov model approach. In: Pattern Recognition, 2002. Proceedings of 16th International Conference on, IEEE, vol. 3, pp. 481–484 (2002)

  3. Bertolami, R., Gutmann, C., Bunke, H., Spitz, A.: Shape code based lexicon reduction for offline handwritten word recognition. In: Document Analysis Systems, 2008. DAS ’08. The Eighth IAPR International Workshop on, pp. 158–163 (2008)

  4. Brakensiek, A., Rottland, J., Kosmala, A., Rigoll, G.: Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In: 7th International Workshop on Frontiers in Handwritten Recognition, pp. 343–352 (2000)

  5. Chherawala, Y., Cheriet, M.: W-tsv: weighted topological signature vector for lexicon reduction in handwritten arabic documents. Pattern Recognit. 45(9), 3277–3287 (2012)

    Article  Google Scholar 

  6. Chherawala, Y., Cheriet, M.: Arabic word descriptor for handwritten word indexing and lexicon reduction. Pattern Recognit. (2014)

  7. Chherawala, Y., Wisnovsky, R., Cheriet, M.: Tsv-lr: topological signature vector-based lexicon reduction for fast recognition of pre-modern arabic subwords. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, ACM, pp. 6–13 (2011)

  8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, IEEE, vol. 1, pp. 886–893 (2005)

  9. Davoudi, H., Kabir, E.: Lexicon reduction for printed farsi subwords using pictorial and textual dictionaries. Int. J. Doc. Anal. Recognit. 17(4), 359–374 (2014)

    Article  Google Scholar 

  10. Downton, A., Tregidgo, R., Kabir, E.: Recognition and verification of handwritten and hand-printer british postal addresses. Int. J. Pattern Recognit. Artif. Intell. 5(1–2), 265–291 (1991)

    Article  Google Scholar 

  11. Ebrahimi, A., Kabir, E.: A pictorial dictionary for printed farsi subwords. Pattern Recognit. Lett. 29(5), 656–663 (2008)

    Article  Google Scholar 

  12. El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: Unconstrained handwritten word recognition using hidden markov models. IEEE Trans. Pattern Anal. Mach. Intell. 21(8), 752–760 (1999)

    Article  Google Scholar 

  13. Farooq, F., Bhardwaj, A., Govindaraju, V.: Using topic models for ocr correction. Int. J. Doc. Anal. Recognit. (IJDAR) 12(3), 153–164 (2009)

    Article  Google Scholar 

  14. Farrahi Moghaddam, R., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: Ibn sina: a database for research on processing and understanding of arabic manuscripts images. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ACM, pp. 11–18 (2010)

  15. Fouladi, K., Araabi, B.N., Kabir, E.: A fast and accurate contour-based method for writer-dependent offline handwritten Farsi/Arabic subwords recognition. Int. J. Doc. Anal. Recognit. (IJDAR), 1–23 (2013)

  16. Glucksman, H.A.: Classification of mixed-font alphabetics by characteristic loci. Tech. rep, DTIC Document (1969)

  17. Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: In: Proceedings of the 7th international workshop on frontiers in handwriting recognition, Citeseer (2000)

  18. Järvelin, K., Kekäläinen, J.: Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp. 41–48 (2000)

  19. Kaltenmeier, A., Caesar, T., Gloger, J., Mandler, E.: Sophisticated topology of hidden markov models for cursive script recognition. In: Document Analysis and Recognition, 1993. Proceedings of the Second International Conference on, IEEE, pp. 139–142 (1993)

  20. Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an framework based on quantized feature vectors. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 1097–1101 (1997)

  21. Koerich, A.L., Sabourin, R., Suen, C.Y.: A time—length constrained level building algorithm for large vocabulary handwritten word recognition. In: Advances in Pattern Recognition—ICAPR 2001, Springer, pp. 127–136 (2001)

  22. Lu, Y., Tan, C.L.: Information retrieval in document image databases. Knowl. Data Eng. IEEE Trans. 16(11), 1398–1410 (2004)

    Article  Google Scholar 

  23. Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction. In: Proceedings of International Workshop on Frontiers in Handwriting Recognition, pp. 71–81 (1993)

  24. Madhvanath, S., Srihari, S.: Effective reduction of large lexicons for recognition of offline cursive script. In: Proceedings of 5th International Workshop on Frontiers in Handwriting Recognition, pp. 189–194. Essex, UK (1996)

  25. Madhvanath, S., Krpasundar, V., Govindaraju, V.: Syntactic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)

    Article  MATH  Google Scholar 

  26. Märgner, V., El Abed, H.: Guide to OCR for Arabic Scripts. Springer, Berlin (2012)

    Book  Google Scholar 

  27. Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: Pattern Recognition, 2000. Proceedings of 15th International Conference on, IEEE, vol. 3, pp. 463–466 (2000)

  28. Menier, G., Lorette, G.: Lexical analyzer based on a self-organizing feature map. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 1067–1071 (1997)

  29. Milewski, R.J., Govindaraju, V., Bhardwaj, A.: Automatic recognition of handwritten medical forms for search engines. Int. J. Doc. Anal. Recognit. (IJDAR) 11(4), 203–218 (2009)

    Article  Google Scholar 

  30. Moghaddam, R.F., Cheriet, M.: Application of multi-level classifiers and clustering for automatic word spotting in historical document images. In: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, IEEE, pp. 511–515 (2009)

  31. Moghaddam, R.F., Moghaddam, F.F., Cheriet, M.: A new framework based on signature patches, micro registration, and sparse representation for optical text recognition. In: Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on, IEEE, pp. 1259–1265 (2012)

  32. Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition. Pattern Recognit. Lett. 29(6), 724–734 (2008a)

    Article  Google Scholar 

  33. Mozaffari, S., Faez, K., Märgner, V., El-Abed, H.: Two-stage lexicon reduction for offline Arabic handwritten word recognition. Int. J. Pattern Recognit. Artif. Intell. 22(7), 1323–1341 (2008b)

    Article  Google Scholar 

  34. Palla, S., Lei, H., Govindaraju, V., (2004) Signature and lexiconpruning techniques. In: Frontiers in HandwritingRecognition, 2004. IWFHR-9 2004. Ninth International Workshop on, IEEE, pp. 474–478

  35. Powalka, R., Sherkat, N., Whitrow, R.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)

    Article  Google Scholar 

  36. Shilane, P., Funkhouser, T.: Distinctive regions of 3d surfaces. ACM Trans. Graph. (TOG) 26(2), 7 (2007)

    Article  Google Scholar 

  37. Smeaton, A.F., Spitz, A.L.: Using character shape coding for information retrieval. In: Document Analysis and Recognition, 1997. Proceedings of the Fourth International Conference on, IEEE, vol. 2, pp. 974–978 (1997)

  38. Srihari, S.N.: Recognition of handwritten and machine-printed text for postal address interpretation. Pattern Recognit. Lett. 14(4), 291–302 (1993)

    Article  Google Scholar 

  39. Tan, C.L., Huang, W., Yu, Z., Xu, Y.: Imaged document text retrieval without ocr. Pattern Anal. Mach. Intell. IEEE Trans. 24(6), 838–844 (2002)

  40. Wshah, S., Govindaraju, V., Cheng, Y., Li, H.: A novel lexicon reduction method for Arabic handwriting recognition. In: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE, pp. 2865–2868 (2010)

  41. Zimmermann, M., Mao, J.: Lexicon reduction using key characters in cursive handwritten words. Pattern Recognit. Lett. 20(11), 1297–1304 (1999)

    Article  Google Scholar 

Download references

Acknowledgments

We thank our colleagues in Synchromedia Laboratory, Reza Farrahi Moghaddam and Youssouf Chherawala, who provide us with their helpful comments on the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Homa Davoudi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davoudi, H., Cheriet, M. & Kabir, E. Lexicon reduction of handwritten Arabic subwords based on the prominent shape regions. IJDAR 19, 139–153 (2016). https://doi.org/10.1007/s10032-016-0262-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-016-0262-6

Keywords

Navigation