A Novel Machine Annotated Balanced Bangla OCR Corpus

Rifat, Md Jamiur Rahman; Banik, Mridul; Hasan, Nazmul; Nahar, Jebun; Rahman, Fuad

doi:10.1007/978-981-16-1092-9_13

Md Jamiur Rahman Rifat⁹,
Mridul Banik⁹,
Nazmul Hasan⁹,
Jebun Nahar⁹ &
…
Fuad Rahman⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1377))

Included in the following conference series:

International Conference on Computer Vision and Image Processing

1385 Accesses
1 Citations

Abstract

We present a balanced and 100% machine annotated Bangla OCR corpus of nearly eight and a half million characters. This is a “first-of-its-kind” effort for the Bangla language. Although Bangla is a top five most frequently used language in the world, it is still considered to be a resource-poor language since even the most basic language processing tools are not available. This corpus goes a long way in mitigating this shortcoming. Also, it is important to mention that this is a continuation of our previous work in building a synthetic corpus for training Bangla OCRs and is an intermediate step in our effort to build a true gold-standard corpus for Bangla OCRs—our ultimate goal. The paper not only discusses the corpus, but also the entire processing pipeline—from scanning pages to identifying lines to segmenting words and finally characters—which are then annotated using a homegrown OCR. We also present a comprehensive corpus characteristics and specification review—demonstrating that we have built a corpus which is very nearly balanced—a very desirable characteristic of any corpus design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rabby, A.S.A., Islam, M.M., Hasan, N., Nahar, J., Rahman, F.: Borno: bangla handwritten character recognition using a multiclass convolutional neural network. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 457–472. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_35
Chapter Google Scholar
Banik, M., Rifat, M.J.R., Nahar, J., Hasan, N., Rahman, F.: Okkhor: a synthetic corpus of bangla printed characters. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 693–711. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_53
Chapter Google Scholar
Bonchanoski, M., Zdravkova, K.: Machine learning-based approach to automatic pos tagging of macedonian language. In: Proceedings of the 8th Balkan Conference in Informatics, pp. 1–8 (2017)
Google Scholar
Rabby, A.S.A., Haque, S., Shahinoor, S.A., Abujar, S., Hossain, S.A.: A universal way to collect and process handwritten data for any language. Procedia Comput. Sci. 143, 502–509 (2018)
Article Google Scholar
Rebholz-Schuhmann, D., et al.: Calbc silver standard corpus. J. Bioinform. Comput. Biol. 8(01), 163–179 (2010)
Article Google Scholar
Wissler, L., Almashraee, M., Díaz, D.M., Paschke, A.: The gold standard in corpus annotation. In: IEEE GSC (2014)
Google Scholar
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22(3), 276–282 (2012)
Article MathSciNet Google Scholar
Hallgren, K.A.: Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials Quant. Method Psychol. 8(1), 23 (2012)
Article Google Scholar
The world factbook - central intelligence agency. https://www.cia.gov. Accessed 21 Feb 2018
Summary by language size. https://www.ethnologue.com/statistics/summary-language-size-19. Accessed 21 Feb 2018
Biswas, M., et al.: Banglalekha-isolated: a multi-purpose comprehensive dataset of handwritten bangla isolated characters. Data in brief 12, 103–107 (2017)
Article Google Scholar
Alam, S., Reasat, T., Doha, R.M., Humayun, A.I.: Numtadb-assembled bengali handwritten digits. arXiv preprint arXiv:1806.02452 (2018)
Rabby, A.S.A., Haque, S., Islam, M.S., Abujar, S., Hossain, S.A.: Ekush: a multipurpose and multitype comprehensive database for online off-line bangla handwritten characters. In: Santosh, K., Hegadi, R.S. (eds.) RTIP2R 2018. CCIS, vol. 1037, pp. 149–158. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-9187-3_14
Chapter Google Scholar
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1
Article Google Scholar
Chung, B.W.: Pro processing for images and computer vision with opencv
Google Scholar
Rhody, H.: Lecture 10: hough circle transform. Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Apurba Technologies, Dhaka, Bangladesh
Md Jamiur Rahman Rifat, Mridul Banik, Nazmul Hasan, Jebun Nahar & Fuad Rahman

Authors

Md Jamiur Rahman Rifat
View author publications
You can also search for this author in PubMed Google Scholar
Mridul Banik
View author publications
You can also search for this author in PubMed Google Scholar
Nazmul Hasan
View author publications
You can also search for this author in PubMed Google Scholar
Jebun Nahar
View author publications
You can also search for this author in PubMed Google Scholar
Fuad Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Jamiur Rahman Rifat .

Editor information

Editors and Affiliations

Indian Institute of Information Technology Allahabad, Prayagraj, India
Satish Kumar Singh
Indian Institute of Technology Roorkee, Roorkee, India
Partha Roy
Indian Institute of Technology Roorkee, Roorkee, India
Balasubramanian Raman
Indian Institute of Information Technology Allahabad, Prayagraj, India
P. Nagabhushan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rifat, M.J.R., Banik, M., Hasan, N., Nahar, J., Rahman, F. (2021). A Novel Machine Annotated Balanced Bangla OCR Corpus. In: Singh, S.K., Roy, P., Raman, B., Nagabhushan, P. (eds) Computer Vision and Image Processing. CVIP 2020. Communications in Computer and Information Science, vol 1377. Springer, Singapore. https://doi.org/10.1007/978-981-16-1092-9_13

Download citation

DOI: https://doi.org/10.1007/978-981-16-1092-9_13
Published: 28 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1091-2
Online ISBN: 978-981-16-1092-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics