Skip to main content
Log in

A top-down character segmentation approach for Assamese and Telugu handwritten documents

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Digitization offers a solution to the challenges associated with managing and retrieving paper-based documents. However, these paper-based documents must be converted into a format that digital machines can comprehend, as they primarily understand alphanumeric text. This transformation is achieved through Optical Character Recognition (OCR), a technology that converts scanned image documents into a format that machines can process. A novel top-down character segmentation approach has been proposed in this work, involving multiple stages. Our approach began by isolating lines from handwritten documents and using these lines to segment words and characters. To further enhance the character segmentation, a Raster Scanning object detection technique is employed to isolate individual characters within words. Thus, the character segmentation results are integrated from the results of the vertical projection and raster scanning. Recognizing the significance of advancing digitization of handwritten documents, we have chosen to focus on the regional languages of Assam and Andhra Pradesh due to their historical and cultural importance in India’s linguistic diversity. So, we have collected datasets of handwritten texts in Assamese and Telugu languages due to their unavailability in the desired form. Our approach achieved an average segmentation accuracy of 93.61%, 85.96%, and 88.74% for lines, words, and characters for both languages. The key motivation behind opting for a top-down approach is two-fold: firstly, it enhances the accuracy of character recognition, and secondly, it holds the potential for future use in language/script identification through the utilization of segmented lines and words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Algorithm 1
Fig. 8
Fig. 9
Algorithm 2
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data availability

The Assamese and Telugu handwritten text dataset is publicly available at the IEEE dataport. Relevant link for the dataset is provided in Sect. 3.1.

Notes

  1. https://ieee-dataport.org/documents/assamese-and-telugu-handwritten-text-dataset.

References

  • Abdulhussain SH, Mahmmod BM, Naser MA et al (2021) A robust handwritten numeral recognition using hybrid orthogonal polynomials and moments. Sensors 21(6):1999

    Article  Google Scholar 

  • Ahamed P, Kundu S, Khan T et al (2020a) Handwritten Arabic numerals recognition using convolutional neural network. J Ambient Intell Humaniz Comput 11:5445–5457

    Article  Google Scholar 

  • Ahmad R, Naz S, Afzal MZ et al (2020b) A deep learning based Arabic script recognition system: benchmark on Khat. Int Arab J Inf Technol 17(3):299–305

    Google Scholar 

  • Ali AAA, Suresha M (2020) Survey on segmentation and recognition of handwritten Arabic script. SN Comput Sci 1(4):192

    Article  Google Scholar 

  • Bag S, Bhowmick P, Harit G et al (2011) Character segmentation of handwritten Bangla text by vertex characterization of isothetic covers. In: 2011 Third national conference on computer vision, pattern recognition, image processing and graphics, IEEE, pp 21–24

  • Bangare SL, Dubal A, Bangare PS et al (2015) Reviewing Otsu’s method for image thresholding. Int J Appl Eng Res 10(9):21777–21783

    Article  Google Scholar 

  • Barakat BK, Droby A, Alaasam R et al (2021) Unsupervised deep learning for text line segmentation. In: 2020 25th International conference on pattern recognition (ICPR). IEEE, pp 2304–2311

  • Batchas BM, Shahid M (2021) The need of a digital typeface for Assamese script. In: International conference of the Indian society of ergonomics. Springer, pp 1599–1610

  • Bose M (1989) Social history of Assam: being a study of the origins of ethnic identity and social tension during the British period, 1905–1947. Concept Publishing Company, India

    Google Scholar 

  • Chatterjee I, Ghosh M, Singh PK et al (2019) A clustering-based feature selection framework for handwritten indic script classification. Expert Syst 36(6):e12459

    Article  Google Scholar 

  • Cheikhrouhou A, Kessentini Y, Kanoun S (2021) Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn 113:107832

    Article  Google Scholar 

  • Chen K, Seuret M, Hennebert J et al (2017) Convolutional neural networks for page segmentation of historical document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, pp 965–970

  • Chen X, Jin L, Zhu Y et al (2021) Text recognition in the wild: a survey. ACM Comput Surv (CSUR) 54(2):1–35

    Article  Google Scholar 

  • Chirimilla R, Vardhan V (2022) A survey of optical character recognition techniques on indic script. ECS Trans 107(1):6507

    Article  Google Scholar 

  • Dutta P, Muppalaneni NB (2022) A survey on image segmentation for handwriting recognition. In: Third international conference on image processing and capsule networks: ICIPCN 2022. Springer, pp 491–506

  • Dutta P, Muppalaneni NB (2024) Assamese and Telugu handwritten text dataset. 10.21227/3ycm-px23

  • Dutta A, Garai A, Biswas S et al (2021) Segmentation of text lines using multi-scale cnn from warped printed and handwritten document images. International Journal on Document Analysis and Recognition (IJDAR) 24(4):299–313

    Article  Google Scholar 

  • Girdher H, Sharma H, Gupta A (2022) Comprehensive survey on Devanagari OCR. Available at SSRN 4033489

  • Grüning T, Leifert G, Strauß T et al (2019) A two-stage method for text line detection in historical documents. Int J Docum Anal Recogn (IJDAR) 22(3):285–302

    Article  Google Scholar 

  • Inunganbi S, Choudhary P, Manglem K (2021) Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition. Vis Comput 37(2):291–305

    Article  Google Scholar 

  • Jindal A, Ghosh R (2023) Word and character segmentation in ancient handwritten documents in Devanagari and Maithili scripts using horizontal zoning. Expert Syst Appl 225:120127

    Article  Google Scholar 

  • Joseph S (2022) Advanced digital image processing technique based optical character recognition of scanned document. J Innov Image Process 4(3):195–205

    Article  Google Scholar 

  • Kaur RP, Kumar M, Jindal M (2022) Performance evaluation of different features and classifiers for Gurumukhi newspaper text recognition. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-021-03687-8

    Article  Google Scholar 

  • Krishna MV, Ram KJ (2021) Digitization, preservation and character recognition in ancient documents using image processing techniques—a review. Int J Commun Comput Technol 9(1):23–26

    MathSciNet  Google Scholar 

  • Kundu S, Paul S, Bera SK et al (2020) Text-line extraction from handwritten document images using gan. Expert Syst Appl 140:112916

    Article  Google Scholar 

  • Lee AW, Chung J, Lee M (2021) Gnhk: A dataset for English handwriting in the wild. In: Document analysis and recognition–ICDAR 2021: 16th International conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part IV vol 16. Springer, pp 399–412

  • Li D, Wu Y, Zhou Y (2021) Linecounter: learning handwritten text line segmentation by counting. In: 2021 IEEE international conference on image processing (ICIP). IEEE, pp 929–933

  • Malik SA, Maqsood M, Aadil F, et al (2020) An efficient segmentation technique for urdu optical character recognizer (ocr). In: Advances in information and communication: proceedings of the 2019 future of information and communication conference (FICC), vol 2. Springer, pp 131–141

  • Mioulet L, Garain U, Chatelain C et al (2015) Language identification from handwritten documents. In: 2015 13th International conference on document analysis and recognition (ICDAR). IEEE, pp 676–680

  • Obaidullah SM, Santosh K, Halder C et al (2019) Automatic indic script identification from handwritten documents: page, block, line and word-level approach. Int J Mach Learn Cybern 10:87–106

    Article  Google Scholar 

  • Pastor-Pellicer J, Afzal MZ, Liwicki M, et al (2016) Complete system for text line extraction using convolutional neural networks and watershed transform. In: 2016 12th IAPR workshop on document analysis systems (DAS). IEEE, pp 30–35

  • Qaroush A, Jaber B, Mohammad K et al (2022) An efficient, font independent word and character segmentation algorithm for printed Arabic text. J King Saud Univ Comput Inf Sci 34(1):1330–1344

    Google Scholar 

  • Rahman AA, Hasan MB, Ahmed S et al (2022) Two decades of Bengali handwritten digit recognition: a survey. IEEE Access 10:92597–92632

    Article  Google Scholar 

  • Rajyagor B, Rakholia R (2021) Tri-level handwritten text segmentation techniques for Gujarati language. Indian J Sci Technol 14(7):618–627

    Article  Google Scholar 

  • Renton G, Chatelain C, Adam S et al (2017) Handwritten text line segmentation using fully convolutional network. In: 2017 14th IAPR International conference on document analysis and recognition (ICDAR). IEEE, pp 5–9

  • Singh G, Sachan MK (2020) An unconstrained and effective approach of script identification for online bilingual handwritten text. Natl Acad Sci Lett 43(5):453–456

    Article  Google Scholar 

  • Singh A, Bacchuwar K, Bhasin A (2012) A survey of ocr applications. Int J Mach Learn Comput 2(3):314

    Article  Google Scholar 

  • Singh S, Garg NK, Kumar M (2023) Feature extraction and classification techniques for handwritten Devanagari text recognition: a survey. Multimed Tools Appl 82(1):747–775

    Article  Google Scholar 

  • Srivastava S, Verma A, Sharma S (2022) Optical character recognition techniques: a review. 2022 IEEE international students’ conference on electrical, electronics and computer science (SCEECS). IEEE, pp 1–6

  • Suleyman E, Hamdulla A, Tuerxun P et al (2021) An adaptive threshold algorithm for offline uyghur handwritten text line segmentation. Wireless Netw 27:3483–3495

    Article  Google Scholar 

  • Tamhankar PA, Masalkar KD et al (2020) A novel approach for character segmentation of offline handwritten Marathi documents written in Modi script. Proc Comput Sci 171:179–187

    Article  Google Scholar 

  • Ukil S, Ghosh S, Obaidullah SM et al (2020) Improved word-level handwritten indic script identification by integrating small convolutional neural networks. Neural Comput Appl 32(7):2829–2844

    Article  Google Scholar 

  • Yousef M, Bishop TE (2020) Origaminet: weakly-supervised, segmentation-free, one-step, full page text recognition by learning to unfold. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14710–14719

  • Zhou J, Wang F, Xu J et al (2019) A novel character segmentation method for serial number on banknotes with complex background. J Ambient Intell Human Comput 10:2955–2969

    Article  Google Scholar 

  • Zouari R, Boubaker H, Kherallah M (2019) Multi-language online handwriting recognition based on beta-elliptic model and hybrid TDNN-SVM classifier. Multimed Tools Appl 78(9):12103–12123

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prarthana Dutta.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest in any part of the work presented in the manuscript.

Ethical approval

In accordance with ethical standards, our research obtained informed consent from participants and followed all necessary ethical review procedures.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dutta, P., Muppalaneni, N.B. A top-down character segmentation approach for Assamese and Telugu handwritten documents. J Ambient Intell Human Comput (2024). https://doi.org/10.1007/s12652-024-04805-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-024-04805-y

Keywords

Navigation