Abstract
Digitization offers a solution to the challenges associated with managing and retrieving paper-based documents. However, these paper-based documents must be converted into a format that digital machines can comprehend, as they primarily understand alphanumeric text. This transformation is achieved through Optical Character Recognition (OCR), a technology that converts scanned image documents into a format that machines can process. A novel top-down character segmentation approach has been proposed in this work, involving multiple stages. Our approach began by isolating lines from handwritten documents and using these lines to segment words and characters. To further enhance the character segmentation, a Raster Scanning object detection technique is employed to isolate individual characters within words. Thus, the character segmentation results are integrated from the results of the vertical projection and raster scanning. Recognizing the significance of advancing digitization of handwritten documents, we have chosen to focus on the regional languages of Assam and Andhra Pradesh due to their historical and cultural importance in India’s linguistic diversity. So, we have collected datasets of handwritten texts in Assamese and Telugu languages due to their unavailability in the desired form. Our approach achieved an average segmentation accuracy of 93.61%, 85.96%, and 88.74% for lines, words, and characters for both languages. The key motivation behind opting for a top-down approach is two-fold: firstly, it enhances the accuracy of character recognition, and secondly, it holds the potential for future use in language/script identification through the utilization of segmented lines and words.
Similar content being viewed by others
Data availability
The Assamese and Telugu handwritten text dataset is publicly available at the IEEE dataport. Relevant link for the dataset is provided in Sect. 3.1.
References
Abdulhussain SH, Mahmmod BM, Naser MA et al (2021) A robust handwritten numeral recognition using hybrid orthogonal polynomials and moments. Sensors 21(6):1999
Ahamed P, Kundu S, Khan T et al (2020a) Handwritten Arabic numerals recognition using convolutional neural network. J Ambient Intell Humaniz Comput 11:5445–5457
Ahmad R, Naz S, Afzal MZ et al (2020b) A deep learning based Arabic script recognition system: benchmark on Khat. Int Arab J Inf Technol 17(3):299–305
Ali AAA, Suresha M (2020) Survey on segmentation and recognition of handwritten Arabic script. SN Comput Sci 1(4):192
Bag S, Bhowmick P, Harit G et al (2011) Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: 2011 Third national conference on computer vision, pattern recognition, image processing and graphics, IEEE, pp 21–24
Bangare SL, Dubal A, Bangare PS et al (2015) Reviewing Otsu’s method for image thresholding. Int J Appl Eng Res 10(9):21777–21783
Barakat BK, Droby A, Alaasam R et al (2021) Unsupervised deep learning for text line segmentation. In: 2020 25th International conference on pattern recognition (ICPR). IEEE, pp 2304–2311
Batchas BM, Shahid M (2021) The need of a digital typeface for Assamese script. In: International conference of the Indian society of ergonomics. Springer, pp 1599–1610
Bose M (1989) Social history of Assam: being a study of the origins of ethnic identity and social tension during the British period, 1905–1947. Concept Publishing Company, India
Chatterjee I, Ghosh M, Singh PK et al (2019) A clustering-based feature selection framework for handwritten indic script classification. Expert Syst 36(6):e12459
Cheikhrouhou A, Kessentini Y, Kanoun S (2021) Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn 113:107832
Chen K, Seuret M, Hennebert J et al (2017) Convolutional neural networks for page segmentation of historical document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, pp 965–970
Chen X, Jin L, Zhu Y et al (2021) Text recognition in the wild: a survey. ACM Comput Surv (CSUR) 54(2):1–35
Chirimilla R, Vardhan V (2022) A survey of optical character recognition techniques on indic script. ECS Trans 107(1):6507
Dutta P, Muppalaneni NB (2022) A survey on image segmentation for handwriting recognition. In: Third international conference on image processing and capsule networks: ICIPCN 2022. Springer, pp 491–506
Dutta P, Muppalaneni NB (2024) Assamese and Telugu handwritten text dataset. 10.21227/3ycm-px23
Dutta A, Garai A, Biswas S et al (2021) Segmentation of text lines using multi-scale cnn from warped printed and handwritten document images. International Journal on Document Analysis and Recognition (IJDAR) 24(4):299–313
Girdher H, Sharma H, Gupta A (2022) Comprehensive survey on Devanagari OCR. Available at SSRN 4033489
Grüning T, Leifert G, Strauß T et al (2019) A two-stage method for text line detection in historical documents. Int J Docum Anal Recogn (IJDAR) 22(3):285–302
Inunganbi S, Choudhary P, Manglem K (2021) Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition. Vis Comput 37(2):291–305
Jindal A, Ghosh R (2023) Word and character segmentation in ancient handwritten documents in Devanagari and Maithili scripts using horizontal zoning. Expert Syst Appl 225:120127
Joseph S (2022) Advanced digital image processing technique based optical character recognition of scanned document. J Innov Image Process 4(3):195–205
Kaur RP, Kumar M, Jindal M (2022) Performance evaluation of different features and classifiers for Gurumukhi newspaper text recognition. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-021-03687-8
Krishna MV, Ram KJ (2021) Digitization, preservation and character recognition in ancient documents using image processing techniques—a review. Int J Commun Comput Technol 9(1):23–26
Kundu S, Paul S, Bera SK et al (2020) Text-line extraction from handwritten document images using gan. Expert Syst Appl 140:112916
Lee AW, Chung J, Lee M (2021) Gnhk: A dataset for English handwriting in the wild. In: Document analysis and recognition–ICDAR 2021: 16th International conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part IV vol 16. Springer, pp 399–412
Li D, Wu Y, Zhou Y (2021) Linecounter: learning handwritten text line segmentation by counting. In: 2021 IEEE international conference on image processing (ICIP). IEEE, pp 929–933
Malik SA, Maqsood M, Aadil F, et al (2020) An efficient segmentation technique for urdu optical character recognizer (ocr). In: Advances in information and communication: proceedings of the 2019 future of information and communication conference (FICC), vol 2. Springer, pp 131–141
Mioulet L, Garain U, Chatelain C et al (2015) Language identification from handwritten documents. In: 2015 13th International conference on document analysis and recognition (ICDAR). IEEE, pp 676–680
Obaidullah SM, Santosh K, Halder C et al (2019) Automatic indic script identification from handwritten documents: page, block, line and word-level approach. Int J Mach Learn Cybern 10:87–106
Pastor-Pellicer J, Afzal MZ, Liwicki M, et al (2016) Complete system for text line extraction using convolutional neural networks and watershed transform. In: 2016 12th IAPR workshop on document analysis systems (DAS). IEEE, pp 30–35
Qaroush A, Jaber B, Mohammad K et al (2022) An efficient, font independent word and character segmentation algorithm for printed Arabic text. J King Saud Univ Comput Inf Sci 34(1):1330–1344
Rahman AA, Hasan MB, Ahmed S et al (2022) Two decades of Bengali handwritten digit recognition: a survey. IEEE Access 10:92597–92632
Rajyagor B, Rakholia R (2021) Tri-level handwritten text segmentation techniques for Gujarati language. Indian J Sci Technol 14(7):618–627
Renton G, Chatelain C, Adam S et al (2017) Handwritten text line segmentation using fully convolutional network. In: 2017 14th IAPR International conference on document analysis and recognition (ICDAR). IEEE, pp 5–9
Singh G, Sachan MK (2020) An unconstrained and effective approach of script identification for online bilingual handwritten text. Natl Acad Sci Lett 43(5):453–456
Singh A, Bacchuwar K, Bhasin A (2012) A survey of ocr applications. Int J Mach Learn Comput 2(3):314
Singh S, Garg NK, Kumar M (2023) Feature extraction and classification techniques for handwritten Devanagari text recognition: a survey. Multimed Tools Appl 82(1):747–775
Srivastava S, Verma A, Sharma S (2022) Optical character recognition techniques: a review. 2022 IEEE international students’ conference on electrical, electronics and computer science (SCEECS). IEEE, pp 1–6
Suleyman E, Hamdulla A, Tuerxun P et al (2021) An adaptive threshold algorithm for offline uyghur handwritten text line segmentation. Wireless Netw 27:3483–3495
Tamhankar PA, Masalkar KD et al (2020) A novel approach for character segmentation of offline handwritten Marathi documents written in Modi script. Proc Comput Sci 171:179–187
Ukil S, Ghosh S, Obaidullah SM et al (2020) Improved word-level handwritten indic script identification by integrating small convolutional neural networks. Neural Comput Appl 32(7):2829–2844
Yousef M, Bishop TE (2020) Origaminet: weakly-supervised, segmentation-free, one-step, full page text recognition by learning to unfold. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14710–14719
Zhou J, Wang F, Xu J et al (2019) A novel character segmentation method for serial number on banknotes with complex background. J Ambient Intell Human Comput 10:2955–2969
Zouari R, Boubaker H, Kherallah M (2019) Multi-language online handwriting recognition based on beta-elliptic model and hybrid TDNN-SVM classifier. Multimed Tools Appl 78(9):12103–12123
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest in any part of the work presented in the manuscript.
Ethical approval
In accordance with ethical standards, our research obtained informed consent from participants and followed all necessary ethical review procedures.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dutta, P., Muppalaneni, N.B. A top-down character segmentation approach for Assamese and Telugu handwritten documents. J Ambient Intell Human Comput (2024). https://doi.org/10.1007/s12652-024-04805-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12652-024-04805-y