Abstract
An optical character recognition engine is the technological solution for preserving books and manuscripts that may soon be lost due to deterioration. In digital form, documents and/or text files are editable, searchable, and shareable. To save them from getting destroyed, documents and/or text files need to be scanned/converted into digital form and passed onto the optical character recognition engine to generate the digital text file. For a large amount of data, manual typing and conversion is nearly impossible. In this paper, the authors have tried to analyze the working of the Tesseract OCR engine for the images that contain Gujarati text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wikipedia (2017) Gujarati language: Wikipedia. https://en.wikipedia.org/wiki/Gujarati_language
Smith R (2007) An overview of the Tesseract OCR engine. In: Ninth international conference on document analysis and recognition, 2007. ICDAR 2007, vol 2, pp 629–633
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
GitHub (2017) Tesseract OCR GitHub. https://github.com/tesseract-ocr
Verstraeten C (2017) How to train Tesseract 3.01: Cédric Verstraeten. https://blog.cedric.ws/how-to-train-tesseract-301
Patel C, Patel A, Patel D (2012) Optical character recognition by open-source OCR tool tesseract: a case study. Int J Comput Appl 55:10
Audichya MK, Saini JKR (2022) A study to recognize printed Gujarati characters using Tesseract OCR. 1, 2 Computer Science, Gujarat Technological University
Patel C, Desai A (2013) Gujarati handwritten character recognition using hybrid method based on binary tree-classifier and K-nearest neighbour. Int J Eng Res Technol (IJERT) 2(6):2337–2345
Chaudhari SA, Gulati RM (2013) An OCR for separation and identification of mixed English: Gujarati digits using kNN classifier. Int Confer Intell Syst Sig Process (ISSP) 13:190–193
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Joshi, K., Arolkar, H. (2024). Working of the Tesseract OCR on Different Fonts of Gujarati Language. In: Joshi, A., Mahmud, M., Ragel, R.G., Kartik, S. (eds) ICT: Cyber Security and Applications. ICTCS 2022. Lecture Notes in Networks and Systems, vol 916. Springer, Singapore. https://doi.org/10.1007/978-981-97-0744-7_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-0744-7_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0743-0
Online ISBN: 978-981-97-0744-7
eBook Packages: EngineeringEngineering (R0)