Advertisement

Multi-font Devanagari Text Recognition Using LSTM Neural Networks

  • Teja KundaikarEmail author
  • Jyoti D. Pawar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1045)

Abstract

Current research in OCR is focusing on the effect of multi-font and multi-size text on OCR accuracy. To the best of our knowledge, no study has been carried out to study the effect of multi-fonts and multi-size text on the accuracy of Devanagari OCRs. The most popular Devanagari OCRs in the market today are Tesseract OCR, Indsenz OCR and eAksharayan OCR. In this research work, we have studied the effect of font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita on these three OCRs. It has been observed that the accuracy of the Devanagari OCRs is dependent on the type of font style in text document images. Hence, we have proposed a multi-font Devanagari OCR (MFD_OCR), text line recognition model using long short-term memory (LSTM) neural networks. We have created training dataset Multi_Font_Train, which consists of text document images and its corresponding text file. This consists of each text line in five different font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita. The test dataset is created using the text from benchmark dataset [1] for each of the font styles as mentioned above, and they are named as BMT_Nakula, BMT_Baloo, BMT_Dekko, BMT_Biryani and BMT_Aparajita test dataset. On the evaluation of all OCRs, the MFD_OCR showed consistent accuracy across all these test datasets. It obtained comparatively good accuracy for BMT_Dekko and BMT_Biryani test datasets. On performing detailed error analysis, we noticed that compared to other Devanagari OCRs, the MFD_OCR has consistent, insertion and deletion type of errors, across all test dataset for each font style. The deletion errors are negligible, ranging from 0.8 to 1.4%.

Keywords

Multi-font OCR Devanagari 

References

  1. 1.
    Mathew, M., Singh, A.K., Jawahar, C.: Multilingual OCR for indic scripts. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 186–191 (2016)Google Scholar
  2. 2.
    Source OCR: Tesseract OCR [Computer software manual]. Retrieved from https://github.com/tesseract-ocr/. Accessed 14 Aug 2018
  3. 3.
    Hellwig, O.: Indsenz ocr [Computer software manual]. Retrieved from http://www.indsenz.com/int/index.php. Accessed 14 Aug 2018
  4. 4.
    tdil: eaksharaya [Computer software manual] (2018). Retrieved from http://tdil-dc.in/eocr/index.html. Accessed 14 Aug 2018
  5. 5.
    Technology development for Indian languages [Computer software manual]. Retrieved from http://tdil.meity.gov.in/. Accessed 26 Jan 2019
  6. 6.
    Sinha, R.: A syntactic pattern analysis system and its application to Devanagari script recognition. Electrical Engineering Department (1973)Google Scholar
  7. 7.
    Palit, S., Chaudhuri, B.: A feature-based scheme for the machine recognition of printed Devanagari script. In: Das, P.P., Chatterjee, B.N. (eds.) Pattern Recognition, Image Processing and Computer Vision, pp. 163–168. Narosa Publishing House, New Delhi, IndiaGoogle Scholar
  8. 8.
    Pal, U., Chaudhuri, B.: Printed Devanagari script ocr system. VIVEK-BOMBAY 10, 12–24 (1997)Google Scholar
  9. 9.
    Pal, U., Chaudhuri, B.: Indian script character recognition: a survey. Pattern Recogn. 37(9), 1887–1899 (2004)CrossRefGoogle Scholar
  10. 10.
    Karayil, T., Ul-Hasan, A., Breuel, T.M.: A segmentation-free approach for printed Devanagari script recognition. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 946–950 (2015)Google Scholar
  11. 11.
    Sankaran, N., Jawahar, C.: Recognition of printed Devanagari text using BLSTM neural network. In: ICPR, vol. 12, pp. 322–325 (2012)Google Scholar
  12. 12.
    Krishnan, P., Sankaran, N., Singh, A.K., Jawahar, C.: Towards a robust OCR system for indic scripts. In: 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), pp. 141–145 (2014)Google Scholar
  13. 13.
    Ul-Hasan, A., Ahmed, S.B., Rashid, F., Shafait, F., Breuel, T.M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: 2013 12th international conference on Document Analysis and Recognition (ICDAR), pp. 1061–1065 (2013)Google Scholar
  14. 14.
    Acharya, S., Pant, A.K., Gyawali, P.K.: Deep learning based large scale handwritten Devanagari character recognition. In: 2015 9th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), pp. 1–6 (2015)Google Scholar
  15. 15.
    Ul-Hasan, A., Bukhari, S.S., Dengel, A.: Ocroract: A sequence learning OCR system trained on isolated characters. In: DAS, pp. 174–179 (2016)Google Scholar
  16. 16.
    Dalvi, G.: Terminology of Devanagari typefaces [Computer software manual]. Retrieved from http://dsource.in/tool/devft/en/terminology.php. Accessed 20 Jan 2019
  17. 17.
    Type, D.E.: Baloo fonts [Computer software manual]. Retrieved from https://fonts.google.com/specimen/Baloo. Accessed 24 Jan 2019
  18. 18.
    Type, D.S.: Dekko fonts [Computer software manual]. Retrieved from https://fonts.google.com/specimen/Dekko. Accessed 24 Jan 2019
  19. 19.
    Font Biryani, G.: Biryani fonts [Computer software manual]. Retrieved from https://fonts.google.com/specimen/Biryani. Accessed 24 Jan 2019
  20. 20.
    Microsoft Corporation: Aparajita fonts [Computer software manual]. Retrieved from https://catalog.monotype.com/font/microsoft-corporation/aparajita/regular. Accessed 24 Jan 2019
  21. 21.
    Rice, S.V., Nartker, T.A.: The ISRI analytic tools for OCR evaluation. UNLV/Information Science Research Institute, TR-96-02 (1996)Google Scholar
  22. 22.
    cfilt.iitb: Word frequency for Hindi [Computer software manual]. Retrieved from http://www.cfilt.iitb.ac.in/Downloads.html. Accessed 14 Aug 2018
  23. 23.
    OCRopy: Open source document analysis and OCR system. [Computer software manual]. Retrieved from https://github.com/tmbdev/ocropy. Accessed 14 Jan 2019

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyGoa UniversityTaleigaoIndia

Personalised recommendations