Zone-based keyword spotting in Bangla and Devanagari documents

Bhunia, Ayan Kumar; Roy, Partha Pratim; Sain, Aneeshan; Pal, Umapada

doi:10.1007/s11042-019-08442-y

Zone-based keyword spotting in Bangla and Devanagari documents

Published: 24 July 2020

Volume 79, pages 27365–27389, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ayan Kumar Bhunia¹,
Partha Pratim Roy ORCID: orcid.org/0000-0003-4526-2015²,
Aneeshan Sain³ &
…
Umapada Pal⁴

180 Accesses
9 Citations
Explore all metrics

Abstract

In this paper, we present a word spotting system in text lines for offline Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown that the zone-wise recognition method improves word recognition performance than the conventional full word recognition system in Indic scripts, like Bangla, Devanagari, Gurumukhi (Roy et al. in Pattern Recogn 60: 1057-1075, 26; Bhunia et al. in Pattern Recogn 79: 12–31, 6). Inspired from this idea we consider the zone segmentation approach and use middle zone information to improve the traditional word spotting performance. To avoid the problem of zone segmentation using heuristic approach, we propose here a new HMM based approach to segment the upper and lower zone components from the text line images. The candidate keywords are searched from a line without segmenting characters or words. Also, we propose a feature combining foreground and background information of text line images for keyword-spotting by character filler models. A significant improvement in performance is noted by using both foreground and background information instead of the individual one. Pyramid Histogram of Oriented Gradient (PHOG) feature has been used in our word spotting framework. From the experiment, it has been noted that the proposed zone-segmentation based system outperforms traditional approaches of word spotting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 12

HP_DocPres: a method for classifying printed and handwritten texts in doctor’s prescription

Article 13 November 2020

Signature identification and verification techniques: state-of-the-art work

Article 28 June 2021

Optical Character Recognition Systems

References

Ahmed R, Al-Khatib WG, Mahmoud S (2016) A survey on handwritten documents word spotting. Int J Multimed Inf Retr:1–17
Almazán J, Gordo A, Fornés A, Valveny E (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
Article Google Scholar
Antonacopoulos A, Downton A (2007) Special issue on the analysis of historical documents. Int J Doc Anal Recognit 9(2):75–77
Article Google Scholar
Bai Y, Guo L, Jin L, Huang Q (2009) A novel feature extraction method using PHOG for smile recognition. In: Proc International Conference on Image Processing, pp 3305–3308
Bhunia AK, Das A, Roy PP, Pal U (2015) A comparative study of features for handwritten Bangla text recognition. In: International Conference on Document Analysis and Recognition, pp 636–640
Bhunia AK, Roy PP, Mohta A, Pal U (2018) Cross-language framework for word recognition and spotting of Indic scripts. Pattern Recogn 79:12–31
Article Google Scholar
Bhunia AK, Das A, Bhunia AK, Kishore PSR, Roy PP (2019) Handwriting recognition in low-resource scripts using adversarial learning. In: IEEE Conference on Computer Vison and Pattern Recognition(CVPR), [Accepted]
Chaudhuri BB, Pal U (1998) A complete printed Bangla OCR system. Pattern Recogn 31(5):531–549
Article Google Scholar
Das A, Bhunia AK, Roy PP, Pal U (2015). Handwritten word spotting in Indic scripts using foreground and background information. In: Proc. Asian Conference on Pattern Recognition (ACPR), pp 426–430
Dutta K, Krishnan P, Mathew M, Jawahar CV (2018) Towards spotting and recognition of handwritten words in Indic Scripts. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, pp 32–37
Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recogn Lett 33:934–942
Article Google Scholar
Frinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224
Article Google Scholar
Jayadevan R, Kolhe SR, Patil PM, Pal U (2012) Automatic processing of handwritten bank cheque images: a survey. Int J Doc Anal Recogn 15(4):267–296
Article Google Scholar
Kavallieratou E, Fakotakis N, Kokkinakis G (2001) Slant estimation algorithm for OCRsystem. Pattern Recogn 34:2515–2522
Article Google Scholar
Leydier Y, Ouji A, Le-Bourgeois F, Emptoz H (2009) Towards an omni-lingual word retrieval system for ancient manuscripts. Pattern Recogn 42:2089–2105
Article Google Scholar
Leydier Y, Ouji A, LeBourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recogn 42(9):2089–2105
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Nagy G, Lopresti D (2006) Interactive document processing and digital libraries. In: Proc. 2nd Internat. Workshop on Document Image Analysis for Libraries, pp 2–11
Niyogi D, Srihari SN, Govindaraju V (1997) Analysis of printed forms. In: Bunke H, Wang PSP (eds) Handbook of character recognition and document image analysis. World Scientific Publishing, pp 485–502
Rath TM, Manmatha R (2007) Word spotting for historical documents. IJDAR 139–152
Rothacker L, Sudholt S, Rusakov E, Kasperidus M, Fink GA (2017) Word hypotheses for segmentation-free word spotting in historic document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, pp 1174–1179
Rothfeder JL, Feng S, Rath TM (2003) Using corner feature correspondences to rank word images by similarity. In: Proc Workshop on Document Image Analysis and Retrieval, pp 30–35
Roy PP, Pal U, Lladós J (2008) Morphology based handwritten line segmentation using foreground and background information. In: Proc International Conference on Frontiers in Handwriting Recognition, pp 241–246
Roy PP, Pal U, Lladós J (2012) Text line extraction in graphical documents using background and foreground information. Int J Doc Anal Recognit 15(3):227–241
Article Google Scholar
Roy PP, Rayar F, Ramel JY (2015) Word spotting in historical documents using primitive based dynamic programming. Image Vis Comput 44:15–28
Article Google Scholar
Roy PP, Bhunia AK, Das A, Dey P, Pal U (2016) HMM-based Indic handwritten word recognition using zone segmentation. Pattern Recogn 60:1057–1075
Article Google Scholar
Roy PP, Bhunia AK, Bhattacharyya A, Pal U (2018) Word searching in scene image and video frame in multi-script scenario using dynamic shape coding. Multimed Tools Appl:1–35
Rusinol M et al (2011) Browsing heterogeneous document collections by a segmentation-free word spotting method. In: Proc. International Conference on Document Analysis and Recognition, pp 63–67
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26(1):43–49
Article Google Scholar
Serrano JR, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recogn 42(9):2106–2116
Article Google Scholar
Srihari SN, Keubert EJ (1997) Integration of handwritten address interpretation technology into the United States postal service remote computer reader system. In: Proc International Conference on Document Analysis and Recognition, pp 892–896
Srihari SN, Huang C, Srinivasan H (2005) A search engine for handwritten documents. Document Recognition and Retrieval, pp.66–75
Sudholt S, Fink GA (2016) PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), October. IEEE, pp 277–282
Sudholt S, Fink GA (2017) Evaluating word string embeddings and loss functions for CNN-based word spotting. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, November. IEEE, pp 493–498
Tarafdar A, Mondal R, Pal S, Pal U, Kimura F (2010) Shape code based word-image matching for retrieval of Indian multi-lingual documents. In: Proc International Conference on Pattern Recognition, pp 1989–1992
Wshah S, Kumar G, Govindaraju V (2014) Statistical script independent word spotting in offline handwritten documents. Pattern Recogn 47(3):1039–1050
Article Google Scholar
S. Young. The HTK book, Version 3.4, 2006.
Zhang X, Pal U, Tan CL (2014) Segmentation-free Keyword Spotting for Bangla Handwritten Documents. In: Proc. International Conference on Frontiers in Handwriting Recognition, pp 381–386

Download references

Author information

Authors and Affiliations

Department of ECE, Institute of Engineering & Management, Kolkata, India
Ayan Kumar Bhunia
Department of CSE, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Partha Pratim Roy
Department of EE, Institute of Engineering & Management, Kolkata, India
Aneeshan Sain
CVPR Unit, Indian Statistical Institute, Kolkata, India
Umapada Pal

Authors

Ayan Kumar Bhunia
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Roy
View author publications
You can also search for this author in PubMed Google Scholar
Aneeshan Sain
View author publications
You can also search for this author in PubMed Google Scholar
Umapada Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayan Kumar Bhunia.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhunia, A.K., Roy, P.P., Sain, A. et al. Zone-based keyword spotting in Bangla and Devanagari documents. Multimed Tools Appl 79, 27365–27389 (2020). https://doi.org/10.1007/s11042-019-08442-y

Download citation

Received: 30 May 2018
Revised: 10 October 2019
Accepted: 07 November 2019
Published: 24 July 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11042-019-08442-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zone-based keyword spotting in Bangla and Devanagari documents

Abstract

Access this article

Similar content being viewed by others

HP_DocPres: a method for classifying printed and handwritten texts in doctor’s prescription

Signature identification and verification techniques: state-of-the-art work

Optical Character Recognition Systems

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Zone-based keyword spotting in Bangla and Devanagari documents

Abstract

Access this article

Similar content being viewed by others

HP_DocPres: a method for classifying printed and handwritten texts in doctor’s prescription

Signature identification and verification techniques: state-of-the-art work

Optical Character Recognition Systems

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation