Abstract
Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/.
Similar content being viewed by others
References
Alaei A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proc. of 12th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 141–145
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2005) An MLP based approach for recognition of handwritten Bangla numerals. In: Proc. of 2nd International Conference on Artificial Intelligence, pp 407–417
Bhattacharya U, Chaudhuri BB (2009) Handwritten numeral databases of Indian scripts and multistage recognition of mixed numerals. IEEE Trans Pattern Anal Mach Intell 3(3):444–457
Bishop CM (2006) Pattern recognition and machine learning. In: Information Science and Statistics. Springer Publishers, New York
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
C-Chang C, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3) article no. 27
le Cessie S, van Houwelingen JC (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
Chanda S, Pal U (2005) English, Devnagari and Urdu text identification. In: Proc. of International Conference on Cognition and Recognition, pp 538–545
Chanda S, Pal S, Pal U (2008) Word-wise Sinhala, Tamil and English script identification using Gaussian kernel SVM. In: Proc. of 19th IEEE International Conference on Pattern Recognition, pp 1–4
Chanda S, Pal S, Franke K, Pal U (2009) Two-stage approach for word-wise script identification. In: Proc. of 10th international Conference on document analysis and recognition (ICDAR), pp 926–930
Chaudhari S, Gulati RM (2016) Script identification using Gabor feature and SVM classifier. In: Proc. of International Conference on Communication, Computing and Virtualization, Procedia Computer Science, vol 79, pp 85–92
Chaudhuri BB (2006) A complete handwritten numeral database of Bangla—a major Indic script. In: Proc. of 10th International Workshop on Frontiers of Handwriting Recognition, La Baule, France, pp 379–384
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dhandra BV, Nagabhushan P, Hangarge M, Hegadi R, Malemath VS (2006) Script identification based on morphological reconstruction in document images. In: Proc. of IEEE International Conference of Pattern Recognition, Hong Kong, vol 2, pp 950–953
Dhandra BV, Mallikarjun H, Hegadi R, Malemath VS (2006) Word-wise script identification from bilingual documents based on morphological reconstruction. In: Proc. of 1st IEEE International Conference on Digital Information Management, pp 389–394
Dhanya D, Ramakrishnan AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27(1):73–82
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56:52–64
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of 2nd International Conference on Knowledge Discovery and Data Mining, vol 96, pp 226–231
Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4:2379–2394
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701
Gonzalez RC, Woods RE (1992) Digital Image Processing, 1st Edn. Prentice-Hall, India
Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision Conference, vol 15
Hassan E, Garg R, Chaudhury S, Gopal M (2011) Script based Text Identification: A Multi-level Architecture. In: Proc. of the 2011 Joint Workshop on multilingual OCR and analytics for noisy unstructured text data. Beijing, China
Hiremath PS, Shivashankar S (2008) Wavelet based co-occurrence histogram features for texture classification with an application to script identification in document image. Pattern Recogn Lett 29(9):1182–1189
Hiremath PS, Shivshankar S, Pujari JD, Mouneswara V (2010) Script identification in a handwritten document image using texture features. In: Proc. of 2nd IEEE International Conference on Advance Computing, pp 110–114
Iman RL, Davenport JM (1980) Approximations of the critical region of the Friedman statistic. Commun Stat 9(6):571–595
Jayadevan R, Kohle SR, Patil PM (2011) Database development and recognition of handwritten Devanagari legal amount words. In: Proc. of 12th IEEE International Conference on Document Analysis and Recognition, pp 304–308
Jindal M, Hemrajani N (2013) Script identification for printed document images at text-line level using DCT and PCA. IOSR J Comput Eng 12(5):97–102
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proc. of 11th Conference on Uncertainty in Artificial Intelligence, San Mateo, pp 338–345
Joshi GD, Garg S, Sivaswamy J (2006) Script identification from Indian documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 255–267
Languages spoken by more than 10 million people. Encarta Encyclopedia (2007) Retrieved 3 Aug 2016
Moravec H (1980) Obstacle avoidance and navigation in the real world by a seeing robot rover. In: Tech report CMU-RI-TR-3 Carnegie-Mellon University, robotics institute
Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University
Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proc. of International Conference on Frontiers in Handwriting Recognition (ICFHR), pp 415–420
Obaidullah SM, Kundu SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12
Padma MC, Vijaya PA (2009) Identification of Telugu, Devnagari and English scripts using discriminating features. Int J Comp Sci Inf Technol 1(2):64–78
Padma MC, Vijaya PA (2010) Global approach for script identification using wavelet packet based features. Int J Sig Process, Image Process Pattern Recognit 3(3):29–40
Padma MC, Vijaya PA (2010) Script identification from trilingual documents using profile based features. Int J Comput Sci Appl (IJCSA) 7(4):16–33
Padma MC, Vijaya PA (2010) Script identification of text words from a tri lingual document using voting technique. Int J Image Process 4(1):35–52
Pal U, Chaudhuri BB (1997) Automatic separation of words in multi lingual multi script Indian documents. In: Proc. of 4th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 576–579
Pal U, Sinha S, Chaudhuri BB (2003) Multi-script line identification from Indian documents. In: Proc. of 7th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 880–884
Pal U, Sharma N, Wakabayashi T, Kimura F (2007) Handwritten numeral recognition of six popular Indian scripts. In: Proc. of 9th IEEE International Conference on Document Analysis and Recognition (ICDAR), pp 749–753
Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: Proc. of 14th International Conference on Frontiers in Handwriting Recognition, pp 375-380
Pati PB, Ramakrishnan AG (2006) HVS inspired system for script identification in Indian multi-script documents. In: Lecture Notes in Computer Science: International Workshop Document Analysis Systems, Nelson, LNCS-3872, pp 380–389
Pati PB, Ramakrishnan AG (2008) Word level multi-script identification. Pattern Recogn Lett 29(9):1218–1229
Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27(1):83–97
Rish I (2001) An empirical study of the naive Bayes classifier. In: IJCAI Workshop on Empirical Methods in AI
Roy K, Pal U (2006) Word-wise handwritten script separation for Indian postal automation. In: Proc. of 10th International Workshop on Frontiers in Handwriting Recognition, La Baule, pp 521–526
Roy K, Das SK, Obaidullah Sk Md (2011) Script identification from handwritten documents. In: Proc. of 3rd IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, Hubli, Karnataka, pp. 66–69
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devnagari handwritten texts mixed with Roman scripts. J Comput 2(2):103–108
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1:a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: Proc. of 5th International Conference on pattern recognition and machine Intelligence (PReMI). LNCS 8251:509–514
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2014) Statistical comparison of classifiers for script identification from multi-script handwritten documents. Int J Appl Pattern Recognit 1(2):152–172
Singh PK, Sarkar R, Nasipuri M (2015) Offline script identification from multilingual Indic-script documents: a state-of-the-art. In: Computer Science Review, Elsevier 15–16:1–28
Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level Script identification from Multi-script handwritten documents. In: Proc. of 3rd IEEE International Conference on Computer, Communication, Control and Information Technology (C3IT), pp 1–6
Singh PK, Sarkar R, Nasipuri M (2015) Line-level script identification for six handwritten scripts using texture based features. In: Proc. of 2nd Information Systems Design and Intelligent Applications. Adv Intell Syst Comput 340:285–293
Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using Modified log-Gabor filter based features. In: Proc. of 2nd IEEE International Conference on Recent Trends in Information Systems (ReTIS), pp 225–230
Singh PK, Chowdhury SP, Sinha S, Eum S, Sarkar R (2017) Page-to-word extraction from unconstrained handwritten document images. In: Proc. of 1st International Conference on Intelligent Computing and Communication (ICIC2), AISC 458, pp. 517-524.
Acknowledgements
The authors are thankful to the CMATER and Project on Storage Retrieval and Understanding of Video for Multimedia (SRUVM) of Computer Science and Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India. Also a lot of people helped us to make the database worthy to use. Authors are grateful to everyone who contributed with data to make this project successful.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Singh, P.K., Sarkar, R., Das, N. et al. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimed Tools Appl 77, 8441–8473 (2018). https://doi.org/10.1007/s11042-017-4745-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4745-3