Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Singh, Pawan Kumar; Sarkar, Ram; Das, Nibaran; Basu, Subhadip; Kundu, Mahantapas; Nasipuri, Mita

doi:10.1007/s11042-017-4745-3

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Published: 18 May 2017

Volume 77, pages 8441–8473, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Pawan Kumar Singh¹,
Ram Sarkar¹,
Nibaran Das¹,
Subhadip Basu¹,
Mahantapas Kundu¹ &
…
Mita Nasipuri¹

469 Accesses
21 Citations
Explore all metrics

Abstract

Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Article Open access 25 August 2023

Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

References

Alaei A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proc. of 12^th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 141–145
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2005) An MLP based approach for recognition of handwritten Bangla numerals. In: Proc. of 2^nd International Conference on Artificial Intelligence, pp 407–417
Bhattacharya U, Chaudhuri BB (2009) Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 3(3):444–457
Article Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. In: Information Science and Statistics. Springer Publishers, New York
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
C-Chang C, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3) article no. 27
le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
Article MATH Google Scholar
Chanda S, Pal U (2005) English, Devnagari and Urdu text identification. In: Proc. of International Conference on Cognition and Recognition, pp 538–545
Chanda S, Pal S, Pal U (2008) Word-wise Sinhala, Tamil and English script identification using Gaussian kernel SVM. In: Proc. of 19^th IEEE International Conference on Pattern Recognition, pp 1–4
Chanda S, Pal S, Franke K, Pal U (2009) Two-stage approach for word-wise script identification. In: Proc. of 10^th international Conference on document analysis and recognition (ICDAR), pp 926–930
Chaudhari S, Gulati RM (2016) Script identification using Gabor feature and SVM classifier. In: Proc. of International Conference on Communication, Computing and Virtualization, Procedia Computer Science, vol 79, pp 85–92
Chaudhuri BB (2006) A complete handwritten numeral database of Bangla—a major Indic script. In: Proc. of 10^th International Workshop on Frontiers of Handwriting Recognition, La Baule, France, pp 379–384
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proc. of IEEE International Conference of Pattern Recognition, Hong Kong, vol 2, pp 950–953
Dhandra BV, Mallikarjun H, Hegadi R, Malemath VS (2006) Word-wise script identification from bilingual documents based on morphological reconstruction. In: Proc. of 1^st IEEE International Conference on Digital Information Management, pp 389–394
Dhanya D, Ramakrishnan AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27(1):73–82
Article MATH Google Scholar
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Article MathSciNet MATH Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of 2^nd International Conference on Knowledge Discovery and Data Mining, vol 96, pp 226–231
Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4:2379–2394
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701
Article MATH Google Scholar
Gonzalez RC, Woods RE (1992) Digital Image Processing, 1^st Edn. Prentice-Hall, India
Google Scholar
Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision Conference, vol 15
Hassan E, Garg R, Chaudhury S, Gopal M (2011) Script based Text Identification: A Multi-level Architecture. In: Proc. of the 2011 Joint Workshop on multilingual OCR and analytics for noisy unstructured text data. Beijing, China
Google Scholar
Hiremath PS, Shivashankar S (2008) Wavelet based co-occurrence histogram features for texture classification with an application to script identification in document image. Pattern Recogn Lett 29(9):1182–1189
Article Google Scholar
Hiremath PS, Shivshankar S, Pujari JD, Mouneswara V (2010) Script identification in a handwritten document image using texture features. In: Proc. of 2^nd IEEE International Conference on Advance Computing, pp 110–114
Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat 9(6):571–595
Jayadevan R, Kohle SR, Patil PM (2011) Database development and recognition of handwritten Devanagari legal amount words. In: Proc. of 12^th IEEE International Conference on Document Analysis and Recognition, pp 304–308
Jindal M, Hemrajani N (2013) Script identification for printed document images at text-line level using DCT and PCA. IOSR J Comput Eng 12(5):97–102
Article Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proc. of 11^thConference on Uncertainty in Artificial Intelligence, San Mateo, pp 338–345
Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 255–267
Languages spoken by more than 10 million people. Encarta Encyclopedia (2007) Retrieved 3 Aug 2016
Moravec H (1980) Obstacle avoidance and navigation in the real world by a seeing robot rover. In: Tech report CMU-RI-TR-3 Carnegie-Mellon University, robotics institute
Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University
Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proc. of International Conference on Frontiers in Handwriting Recognition (ICFHR), pp 415–420
Obaidullah SM, Kundu SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12
Article Google Scholar
Padma MC, Vijaya PA (2009) Identification of Telugu, Devnagari and English scripts using discriminating features. Int J Comp Sci Inf Technol 1(2):64–78
Padma MC, Vijaya PA (2010) Global approach for script identification using wavelet packet based features. Int J Sig Process, Image Process Pattern Recognit 3(3):29–40
Google Scholar
Padma MC, Vijaya PA (2010) Script identification from trilingual documents using profile based features. Int J Comput Sci Appl (IJCSA) 7(4):16–33
Google Scholar
Padma MC, Vijaya PA (2010) Script identification of text words from a tri lingual document using voting technique. Int J Image Process 4(1):35–52
Google Scholar
Pal U, Chaudhuri BB (1997) Automatic separation of words in multi lingual multi script Indian documents. In: Proc. of 4^th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 576–579
Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proc. of 7^th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 880–884
Pal U, Sharma N, Wakabayashi T, Kimura F (2007) Handwritten numeral recognition of six popular Indian scripts. In: Proc. of 9^th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 749–753
Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: Proc. of 14^th International Conference on Frontiers in Handwriting Recognition, pp 375-380
Pati PB, Ramakrishnan AG (2006) HVS inspired system for script identification in Indian multi-script documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 380–389
Pati PB, Ramakrishnan AG (2008) Word level multi-script identification. Pattern Recogn Lett 29(9):1218–1229
Article Google Scholar
Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27(1):83–97
Article Google Scholar
Rish I (2001) An empirical study of the naive Bayes classifier. In: IJCAI Workshop on Empirical Methods in AI
Roy K, Pal U (2006) Word-wise handwritten script separation for Indian postal automation. In: Proc. of 10^th International Workshop on Frontiers in Handwriting Recognition, La Baule, pp 521–526
Roy K, Das SK, Obaidullah Sk Md (2011) Script identification from handwritten documents. In: Proc. of 3rd IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, Hubli, Karnataka, pp. 66–69
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devnagari handwritten texts mixed with Roman scripts. J Comput 2(2):103–108
Google Scholar
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1:a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: Proc. of 5^thInternational Conference on pattern recognition and machine Intelligence (PReMI). LNCS 8251:509–514
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2014) Statistical comparison of classifiers for script identification from multi-script handwritten documents. Int J Appl Pattern Recognit 1(2):152–172
Article Google Scholar
Singh PK, Sarkar R, Nasipuri M (2015) Offline script identification from multilingual Indic-script documents: a state-of-the-art. In: Computer Science Review, Elsevier 15–16:1–28
Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level Script identification from Multi-script handwritten documents. In: Proc. of 3^rd IEEE International Conference on Computer, Communication, Control and Information Technology (C3IT), pp 1–6
Singh PK, Sarkar R, Nasipuri M (2015) Line-level script identification for six handwritten scripts using texture based features. In: Proc. of 2^ndInformation Systems Design and Intelligent Applications. Adv Intell Syst Comput 340:285–293
Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using Modified log-Gabor filter based features. In: Proc. of 2^nd IEEE International Conference on Recent Trends in Information Systems (ReTIS), pp 225–230
Singh PK, Chowdhury SP, Sinha S, Eum S, Sarkar R (2017) Page-to-word extraction from unconstrained handwritten document images. In: Proc. of 1^st International Conference on Intelligent Computing and Communication (ICIC²), AISC 458, pp. 517-524.

Download references

Acknowledgements

The authors are thankful to the CMATER and Project on Storage Retrieval and Understanding of Video for Multimedia (SRUVM) of Computer Science and Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India. Also a lot of people helped us to make the database worthy to use. Authors are grateful to everyone who contributed with data to make this project successful.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
Pawan Kumar Singh, Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu & Mita Nasipuri

Authors

Pawan Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Ram Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author in PubMed Google Scholar
Subhadip Basu
View author publications
You can also search for this author in PubMed Google Scholar
Mahantapas Kundu
View author publications
You can also search for this author in PubMed Google Scholar
Mita Nasipuri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ram Sarkar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, P.K., Sarkar, R., Das, N. et al. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimed Tools Appl 77, 8441–8473 (2018). https://doi.org/10.1007/s11042-017-4745-3

Download citation

Received: 20 September 2016
Revised: 20 February 2017
Accepted: 21 April 2017
Published: 18 May 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11042-017-4745-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Abstract

Access this article

Similar content being viewed by others

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Abstract

Access this article

Similar content being viewed by others

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Line-Level Script Identification for Six Handwritten Scripts Using Texture Based Features

Identification of Devnagari and Roman Scripts from Multi-script Handwritten Documents

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation