Abstract
We present a balanced and 100% machine annotated Bangla OCR corpus of nearly eight and a half million characters. This is a “first-of-its-kind” effort for the Bangla language. Although Bangla is a top five most frequently used language in the world, it is still considered to be a resource-poor language since even the most basic language processing tools are not available. This corpus goes a long way in mitigating this shortcoming. Also, it is important to mention that this is a continuation of our previous work in building a synthetic corpus for training Bangla OCRs and is an intermediate step in our effort to build a true gold-standard corpus for Bangla OCRs—our ultimate goal. The paper not only discusses the corpus, but also the entire processing pipeline—from scanning pages to identifying lines to segmenting words and finally characters—which are then annotated using a homegrown OCR. We also present a comprehensive corpus characteristics and specification review—demonstrating that we have built a corpus which is very nearly balanced—a very desirable characteristic of any corpus design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rabby, A.S.A., Islam, M.M., Hasan, N., Nahar, J., Rahman, F.: Borno: bangla handwritten character recognition using a multiclass convolutional neural network. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 457–472. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_35
Banik, M., Rifat, M.J.R., Nahar, J., Hasan, N., Rahman, F.: Okkhor: a synthetic corpus of bangla printed characters. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 693–711. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_53
Bonchanoski, M., Zdravkova, K.: Machine learning-based approach to automatic pos tagging of macedonian language. In: Proceedings of the 8th Balkan Conference in Informatics, pp. 1–8 (2017)
Rabby, A.S.A., Haque, S., Shahinoor, S.A., Abujar, S., Hossain, S.A.: A universal way to collect and process handwritten data for any language. Procedia Comput. Sci. 143, 502–509 (2018)
Rebholz-Schuhmann, D., et al.: Calbc silver standard corpus. J. Bioinform. Comput. Biol. 8(01), 163–179 (2010)
Wissler, L., Almashraee, M., Díaz, D.M., Paschke, A.: The gold standard in corpus annotation. In: IEEE GSC (2014)
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–282 (2012)
Hallgren, K.A.: Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials Quant. Method Psychol. 8(1), 23 (2012)
The world factbook - central intelligence agency. https://www.cia.gov. Accessed 21 Feb 2018
Summary by language size. https://www.ethnologue.com/statistics/summary-language-size-19. Accessed 21 Feb 2018
Biswas, M., et al.: Banglalekha-isolated: a multi-purpose comprehensive dataset of handwritten bangla isolated characters. Data in brief 12, 103–107 (2017)
Alam, S., Reasat, T., Doha, R.M., Humayun, A.I.: Numtadb-assembled bengali handwritten digits. arXiv preprint arXiv:1806.02452 (2018)
Rabby, A.S.A., Haque, S., Islam, M.S., Abujar, S., Hossain, S.A.: Ekush: a multipurpose and multitype comprehensive database for online off-line bangla handwritten characters. In: Santosh, K., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1037, pp. 149–158. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9187-3_14
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1
Chung, B.W.: Pro processing for images and computer vision with opencv
Rhody, H.: Lecture 10: hough circle transform. Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rifat, M.J.R., Banik, M., Hasan, N., Nahar, J., Rahman, F. (2021). A Novel Machine Annotated Balanced Bangla OCR Corpus. In: Singh, S.K., Roy, P., Raman, B., Nagabhushan, P. (eds) Computer Vision and Image Processing. CVIP 2020. Communications in Computer and Information Science, vol 1377. Springer, Singapore. https://doi.org/10.1007/978-981-16-1092-9_13
Download citation
DOI: https://doi.org/10.1007/978-981-16-1092-9_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1091-2
Online ISBN: 978-981-16-1092-9
eBook Packages: Computer ScienceComputer Science (R0)