A benchmark image database of isolated Bangla handwritten compound characters

  • Nibaran Das
  • Kallol Acharya
  • Ram Sarkar
  • Subhadip Basu
  • Mahantapas Kundu
  • Mita Nasipuri
Original Paper

Abstract

In the present work, we present a benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature. A thorough survey over more than 2 million Bangla words has revealed that there exist around 334 compound characters in Bangla script. Of which, only around 171 character classes form unique pattern shapes, and some of these classes are often written in multiple styles. Altogether, 55,278 isolated character images, belonging to 199 different pattern shapes, are collected using three different data collection modalities. The database is divided into training and test sets in 4:1 ratio for each pattern class, by considering a balanced distribution of shapes from different modalities. A convex hull and quadtree-based feature set has been designed, and the test set recognition performance is reported with the support vector machine classifier. We have achieved a recognition accuracy of 79.35 % on the test database consisting of 171 character classes. The complete compound character image database is freely available as CMATERdb 3.1.3.3 from the website http://code.google.com/p/cmaterdb/, which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems.

Keywords

OCR Handwritten character recognition Bangla Compound character Benchmark database SVM 

Supplementary material

10032_2014_222_MOESM1_ESM.pdf (4.2 mb)
Supplementary material 1 (pdf 4340 KB)

References

  1. 1.
    Fujisawa, H.: Forty years of research in character and document recognition—an industrial perspective. Pattern Recognit. 41(8), 2435–2446 (2008)CrossRefGoogle Scholar
  2. 2.
    Cheriet, M., El Yacoubi, M., Fujisawa, H., Lopresti, D., Lorette, G.: Handwriting recognition research: twenty years of achievement \(\cdots \) and beyond. Pattern Recognit. 42(12), 3131–3135 (2009)CrossRefGoogle Scholar
  3. 3.
    Su, T.-H., Zhang, T.-W., Guan, D.-J., Huang, H.-J.: Off-line recognition of realistic chinese handwriting using segmentation-free strategy. Pattern Recognit. 42(1), 167–182 (2009)CrossRefMATHGoogle Scholar
  4. 4.
    Srihari, S., Yang, X., Ball, G.: Offline chinese handwriting recognition: an assessment of current technology. Front. Comput. Sci. China 1(2), 137–155 (2007)CrossRefGoogle Scholar
  5. 5.
    Kimura, F.: OCR Technologies for machine printed and hand printed Japanese text. In: Chaudhuri, B.B. (ed.) Digital document processing. Advances in pattern recognition, pp. 49–71. Springer, London (2007)CrossRefGoogle Scholar
  6. 6.
    Kwon, J.-O., Sin, B., Kim, J.H.: Recognition of on-line cursive korean characters combining statistical and structural methods. Pattern Recognit. 30(8), 1255–1263 (1997)CrossRefGoogle Scholar
  7. 7.
    Kim, H.J., Kim, P.K.: Recognition of off-line handwritten korean characters. Pattern Recognit. 29(2), 245–254 (1996)CrossRefGoogle Scholar
  8. 8.
    Amin, A.: Off line Arabic character recognition: a survey. In: The fourth international conference on document analysis and recognition, pp. 596–599 (1997)Google Scholar
  9. 9.
    Pal, U., Chaudhuri, B.B.: Indian script character recognition: a survey. Pattern Recognit. 37(9), 1887–1899 (2004)CrossRefGoogle Scholar
  10. 10.
    Pal, U., Jayadevan, R., Sharma, N.: Handwriting recognition in indian regional scripts: a survey of offline techniques. ACM Trans. Asian Lang. Inf. Process. 11(1), 1–35 (2012)CrossRefGoogle Scholar
  11. 11.
    Arya, D., Jawahar, C., Bhagvati, C., Patnaik, T., Chaudhuri, B., Lehal, G., Chaudhury, S., Ramakrishna, A.: Experiences of integration and performance testing of multilingual OCR for printed Indian scripts. In: Proceedings of the 2011 joint workshop on multilingual OCR and analytics for noisy unstructured text data, p. 9. ACM (2011)Google Scholar
  12. 12.
    Pal, U., Wakabayashi, T., Kimura, F.: Comparative study of Devnagari handwritten character recognition using different feature and classifiers. In: 10th international conference on document analysis and recognition (ICDAR ’09.), pp. 1111–1115 (2009)Google Scholar
  13. 13.
    Jagadeesh Kannan, R., Prabhakar, R.: A comparative study of optical character recognition for tamil script. Eur. J. Sci. Res. 35(4), 570–582 (2009)Google Scholar
  14. 14.
    Pal, U., Wakabayashi, T., Kimura, F.: A system for off-line Oriya handwritten character recognition using curvature feature. In: 10th international conference on information technology (ICIT 2007), pp. 227–229 (2007)Google Scholar
  15. 15.
    Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognit. 42(7), 1467–1484 (2009)CrossRefMATHGoogle Scholar
  16. 16.
    Pal, U., Wakabayashi, T., Kimura, F.: Handwritten Bangla compound character recognition using gradient feature. In: 10\(^{th}\) international conference on information technology-07, pp. 208–213 (2007)Google Scholar
  17. 17.
    Roy, K., Pal, U., Kimura, F.: Bangla handwritten character recognition. In: Prasad, B. (ed.) 2\(^{nd}\) Indian international conference on artificial intelligence, pp. 431–443. Pune, India (2005)Google Scholar
  18. 18.
    Bhattacharya, U., Parui, S.K., Shridhar, M., Kimura, F.: Two-stage recognition of handwritten Bangla alphanumeric characters using neural classifiers. In: Prasad, B. (ed.) 2\(^{nd}\) Indian international conference on artificial intelligence, pp. 1357–1376. Pune, India (2005)Google Scholar
  19. 19.
    Bhowmik, T., Bhattacharya, U., Parui, S.: Recognition of bangla handwritten characters using an mlp classifier based on stroke features. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.) Neural Inf. Process. Lecture notes in computer science, vol. 3316, pp. 814–819. Springer, Berlin (2004)CrossRefGoogle Scholar
  20. 20.
    Chaudhuri, B.B., Pal, U.: A complete printed bangla ocr system. Pattern Recognit. 31(5), 531–549 (1998)CrossRefGoogle Scholar
  21. 21.
    Bhowmik, T., Ghanty, P., Roy, A., Parui, S.: Svm-based hierarchical architectures for handwritten bangla character recognition. Int. J. Doc. Anal. Recognit. 12(2), 97–108 (2009)CrossRefGoogle Scholar
  22. 22.
    Das, N., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application. Appl. Soft Comput. 12(5), 1592–1606 (2012)CrossRefGoogle Scholar
  23. 23.
    Das, N., Pramanik, S., Basu, S., Saha, P.K., Sarkar, R., Kundu, M., Nasipuri, M.: Recognition of handwritten Bangla basic characters and digits using convex hull based feature set. In: Dimitrios A. Karras, Z.M., Etienne E. Kerre, Chunping Li (eds.) International conference on artificial intelligence and pattern recognition, Orlando, Florida, USA, pp. 380–386. ISRST (2009)Google Scholar
  24. 24.
  25. 25.
  26. 26.
    Das, N., Das, B., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M.: Handwritten bangla basic and compound character recognition using mlp and svm classifier. J. Comput. 2(2), 109–115 (2010)Google Scholar
  27. 27.
  28. 28.
    Sarkar, P., Mukhopadhay, A., DasGupta, P.: Akaademi Bannan Abhidhan. In: Chakrabarty, N., Ghosh, S., Sarkar, P., Chaki, J., Das, N., Mukhopadhay, A., Bhattachajee, S., Amitava, C., Mukhopadhay, A., Bhattacharjee, S., Das, P., Chattopadhay, S., Basu, A., Mandal, S. (eds.). Akademi Bannan Abhidhan, p. 582. Pachimbanga Bangla Akaademi, Kolkata (2008)Google Scholar
  29. 29.
    Wilkinson, R.A., Geist, J., Janet, S., Grother, P.J., Burges, C.J.C., Creecy, R., Hammond, B., Hull, J.J., Larsen, N.J., Vogl, T.P., Wilson, C.L.: In: The first census optical character recognition system conference. p. 372 (1992)Google Scholar
  30. 30.
    Marti, U.V., Bunke, H.: The iam-database: an english sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit. 5(1), 39–46 (2002)CrossRefMATHGoogle Scholar
  31. 31.
    Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)Google Scholar
  32. 32.
    MNIST Dataset. http://yann.lecun.com/exdb/mnist. Accessed 29th July 2011
  33. 33.
    OCR Database. http://ai.stanford.edu/~btaskar/ocr/ (2011). Accessed 22nd July 2011
  34. 34.
    Honggang, Z., Jun, G., Guang, C., Chunguang, L.: HCL2000 - A large-scale handwritten Chinese character database for handwritten character recognition. In: ICDAR ’09., pp. 286–290 (2009)Google Scholar
  35. 35.
    Abdleazeem, S., El-Sherif, E.: Arabic handwritten digit recognition. Int. J. Doc. Anal. Recognit. 11(3), 127–141 (2008)CrossRefGoogle Scholar
  36. 36.
    Khosravi, H., Kabir, E.: Introducing a very large dataset of handwritten farsi digits and a study on their varieties. Pattern Recognit. Lett. 28(10), 1133–1141 (2007)CrossRefGoogle Scholar
  37. 37.
    Mozaffari, S., Faez, K., Faradji, F., Ziaratban, M. A., Golzan, S.M.: A comprehensive isolated Farsi/Arabic character database for handwritten OCR research. In: Tenth international workshop on frontiers in handwriting recognition, La Baule (France), pp. 385–389 (2006)Google Scholar
  38. 38.
    Al-Ohali, Y., Cheriet, M., Suen, C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36(1), 111–121 (2003)CrossRefMATHGoogle Scholar
  39. 39.
    Kavallieratou, E., Liolios, N., Koutsogeorgos, E., Fakotakis, N., Kokkinakis, G.: The GRUHD database of Greek unconstrained handwriting. In: Sixth international conference on document analysis and recognition, pp. 561–565 (2001)Google Scholar
  40. 40.
    Viard-Gaudin, C., Lallican, P.M., Knerr, S., Binter, P.: The IRESTE On/Off (IRONOFF) dual handwriting database. In: Fifth international conference on document analysis and recognition (ICDAR ’99.), pp. 455–458 (1999)Google Scholar
  41. 41.
    Kim, D.-H., Hwang, Y.-S., Park, S.-T., Kim, E.-J., Paek, S.-H., Bang, S.-y.: Handwritten Korean Character Image Database PE92. In. IEICE transactions on information and systems, pp. 943–950 (1996)Google Scholar
  42. 42.
    Noumi, T., Matsui, T., Yamashita, I., Wakahara, T., Tsutsumida, T.: Tegaki Suji database ’IPTP CD-ROM1’ no ichi bunseki (in Japanese). Autumn Meeting of IEICE D-309 (1994)Google Scholar
  43. 43.
    Yamada, H., Yamamoto, K., Saito, T.: A nonlinear normalization method for handprinted kanji character recognition-line density equalization. Pattern Recognit. 23(9), 1023–1029 (1990)CrossRefGoogle Scholar
  44. 44.
    Liu, Y., Tai, J., Liu, J.: An introduction to the 4 million handwriting Chinese character samples library. In: International conference on Chinese computing and orient language processing, Changsa, China, pp. 94–97 (1989)Google Scholar
  45. 45.
    Saito, T., Yamada, H., Yamamoto, K.: On the Database ELT9 of Handprinted Characters in JIS Chinese Characters and Its Analysis (in Japanese). Trans. IECEJ J.68-D(4), 757–764 (1985)Google Scholar
  46. 46.
    Mori, S., Yamamoto, K., Yamada, H., Saito, T.: On a handprinted kyoiku-kanji character data base. Bull. Electrotech. Lab. 43(11–12), 752–773 (1979)Google Scholar
  47. 47.
  48. 48.
  49. 49.
  50. 50.
    Bhattacharya, U.: Handwritten character databases of indic scripts. http://www.isical.ac.in/~ujjwal/download/database.html (2011). Accessed 22nd July 2011
  51. 51.
    Bhattacharya, U., Chaudhuri, B.B.: Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 444–457 (2009)CrossRefGoogle Scholar
  52. 52.
    Bhattacharya, U., Shridhar, M., Parui, S.K., Sen, P.K., Chaudhuri, B.B.: Offline recognition of handwritten bangla characters: an efficient two-stage approach. Pattern Anal. Appl. 15(4), 445–458 (2012)CrossRefMathSciNetGoogle Scholar
  53. 53.
    Das, N., Reddy, J.M., Sarkar, R., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: A statistical-topological feature combination for recognition of handwritten numerals. Appl. Soft Comput. 12(8), 2486–2495 (2012)CrossRefGoogle Scholar
  54. 54.
    Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.: Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla-english mixed script document image. Int. J. Doc. Anal. Recognit. 15(1), 71–83 (2012)CrossRefGoogle Scholar
  55. 55.
    Chattopadhyay, S.K.: Bangla Bhasatattver Bhumika. Calcutta University Press, Kolkata (1974)Google Scholar
  56. 56.
    Sproat, R.: A formal computational analysis of indic scripts. In: International symposium on indic scripts: past and future, Tokyo (2003)Google Scholar
  57. 57.
    Consortium, U.: The unicode standard, Version 6.1—core specification. The Unicode Consortium, Mountain View, CA, 2012. In. ISBN 978-1-936213-02-3. URL http://www.unicode.org/versions/Unicode6.1.0
  58. 58.
    MSVS, B.R., Vardhan, V., GA, N., Reddy, P.: A noval security model for indic scripts-a case study on Telugu. Int. J. Comput. Sci. Secur. (IJCSS) 3(4), 303Google Scholar
  59. 59.
    Das, N.S.: Modern Bengali script: an introduction. Dakhabharati, Kolkata (2010)Google Scholar
  60. 60.
    AnandaBazar Patrika. http://www.anandabazar.in/ (2012). Accessed 10th March 2012
  61. 61.
    AajKaal. http://www.aajkaal.net (2011). Accessed 29th July 2011
  62. 62.
    Bartaman. http://bartamanpatrika.com (2011). Accessed 29th July 2011
  63. 63.
    Anandamela, Desh. http://my.anandabazar.com/content/magazines (2011). Accessed 29th July 2011
  64. 64.
    Sarat Rachanabali. http://www.sarat-rachanabali.nltr.org/ (2012). Accessed 3rd March 2012
  65. 65.
    Bangla Documets. http://banglalibrary.evergreenbangla.com/ (2012). Accessed 4th March 2012
  66. 66.
    Newspapers from Bangladesh. http://new.ittefaq.com.bd/, http://www.prothom-alo.com/ (2012). Accessed 4th March 2012
  67. 67.
    CMATER Handwritten Character Database. http://code.google.com/p/cmaterdb/ (2011). Accessed 1st Aug 2011
  68. 68.
    Bhattacharya, U., Shridhar, M., Parui, S.: On recognition of handwritten Bangla characters. In: Kalra, P., Peleg, S. (eds.) Computer vision, graphics and image processing. Lecture notes in computer science, pp. 817–828. Springer, Berlin (2006)Google Scholar
  69. 69.
    Rahman, A.F.R., Rahman, R., Fairhurst, M.C.: Recognition of handwritten bengali characters: a novel multistage approach. Pattern Recognit. 35(5), 997–1006 (2002)CrossRefMATHGoogle Scholar
  70. 70.
    Wen, Y., Lu, Y., Shi, P.: Handwritten bangla numeral recognition system and its application to postal automation. Pattern Recognit. 40(1), 99–107 (2007)CrossRefMATHGoogle Scholar
  71. 71.
    Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Kumar Basu, D.: A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit. 43(10), 3507–3521 (2010) Google Scholar
  72. 72.
    Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: Handwritten Bangla compound character recognition: potential challenges and probable solution. In: Prasad, B., Lingras, P., Ram, A. (eds.) 4th Indian international conference on artificial intelligence, Bangalore, pp. 1901–1913 (2009)Google Scholar
  73. 73.
    Chaudhuri, B.B., Pal, U.: Relational studies between phoneme and grapheme statistics in current bangla. J. Acoust. Soc. India 23, 67–77 (1995)Google Scholar
  74. 74.
    Pal, U., Chaudhury, B.B.: Character occurrence statistics in Bangla language and recognition of Bangla printed script. In: ICAPRDT, Kolkata, pp. 52–59 (1993)Google Scholar
  75. 75.
    Das, N., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.K.: An Improved feature descriptor for recognition of handwritten Bangla alphabet. In: Guru, D.S., Vasudev, T. (eds.) International conference on signal and image processing, Mysore, India, pp. 451–454. Excel India Publishers (2009)Google Scholar
  76. 76.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  77. 77.
    Basu, S., Das, N., Sarkar, R., Kundu, M., Nasipuri, M., Basu, D.: Recognition of numeric postal codes from multi-script postal address blocks. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, S. (eds.) Pattern Recognit. Mach. Intell. Lecture notes in computer science, vol. 5909, pp. 381–386. Springer, Berlin (2009)Google Scholar
  78. 78.
    Marlow, B.K., Batchelor, B.G.: Improving the speed of convex hull calculations. Electron. Lett. 16(9), 319–321 (1980)CrossRefGoogle Scholar
  79. 79.
    Chang, C.-C., Lin, C.-J.: Libsvm : a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)CrossRefGoogle Scholar
  80. 80.
    Das, N., Mandal, B., Basu, S., Sarkar, R., Kundu, M., Nasipuri, M.: An SVM-MLP classifier combination scheme for recognition of handwritten Bangla digits. In: Kale, K.V., Malhrota, S.C., Manza, R.R. (eds.) 2nd International conference on advances in computer vision and information technology, Aurangabad, India, pp. 615–623. I. K. International Publishing House Pvt. Ltd. (2009)Google Scholar
  81. 81.
    Basu, S., Chaudhuri, C., Kundu, M., Nasipuri, M., Basu, D.: A two-pass approach to pattern classification neural information processing. In: Pal, N., Kasabov, N., Mudi, R., Pal, S., Parui, S. (eds.), vol. 3316. Lecture notes in computer science, pp. 781–786. Springer, Berlin (2004)Google Scholar
  82. 82.
    El Abed, H., Märgner, V., Blumenstein, M.: international conference on frontiers in handwriting recognition (ICFHR 2010)—competitions overview. In: 12th international conference on frontiers in handwriting recognition pp. 703–708 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Nibaran Das
    • 1
  • Kallol Acharya
    • 1
  • Ram Sarkar
    • 1
  • Subhadip Basu
    • 1
  • Mahantapas Kundu
    • 1
  • Mita Nasipuri
    • 1
  1. 1.Computer Science and Engineering DepartmentJadavpur UniversityKolkata India

Personalised recommendations