Abstract
This paper describes how UHTelPCC, a dataset for Telugu printed character recognition, is created and its characteristics. The dataset is created from characters extracted from images of printed Telugu texts from the period 1950–1990. Thus, it is hoped that the dataset provides the basis for developing practical Telugu OCR systems. UHTelPCC is to provide a standard benchmark for comparing different algorithms for Telugu OCR and helps in research and development of Telugu OCR systems. UHTelPCC contains 70K samples of 325 classes, and these samples are divided into 50K, 10K, 10K training, validation, and test sets respectively. It is hoped that UHTelPCC serves like MNIST, a dataset for handwritten digit recognition, for Telugu printed character recognition. The baseline performances on the test set using KNN, MLP, and CNN are 98.85%, 99.52%, and 99.68% respectively. UHTelPCC is available at http://scis.uohyd.ac.in/~chakcs/UHTelPCC.html.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Achanta, R., Hastie, T.: Telugu OCR framework using deep learning. arXiv preprint arXiv:1509.05962 (2015)
Balm, G.: An introduction to optical character reader considerations. Pattern Recogn. 2(3), 151–166 (1970)
Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)
Dongre, V.J., Mankar, V.H.: Development of comprehensive Devnagari numeral and character database for offline handwritten character recognition. Appl. Comput. Intell. Soft Comput. 2012, 29 (2012)
Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation methods for character recognition: from segmentation to document structure analysis. Proc. IEEE 80(7), 1079–1092 (1992)
Gonzalez, R.C., Woods, R.E., et al.: Digital Image Processing (2002)
Govindan, V., Shivaprasad, A.: Character recognition - a review. Pattern Recogn. 23(7), 671–683 (1990)
Govindaraju, V., Setlur, S.: Guide to OCR for Indic Scripts. Springer, London (2009)
Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Data sets for OCR and document image understanding research. In: Handbook of Character Recognition and Document Image Analysis, pp. 779–799. World Scientific (1997)
Hegadi, R.S., Kamble, P.M.: Recognition of Marathi handwritten numerals using multi-layer feed-forward neural network. In: 2014 World Congress on Computing and Communication Technologies (WCCCT), pp. 21–24. IEEE (2014)
Impedovo, S., Ottaviano, L., Occhinegro, S.: Optical character recognition: a survey. Int. J. Pattern Recogn. Artif. Intell. 5(01n02), 1–24 (1991)
Jayadevan, R., Kolhe, S.R., Patil, P.M., Pal, U.: Offline recognition of Devanagari script: a survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 41(6), 782–796 (2011)
John, J., Pramod, K., Balakrishnan, K.: Offline handwritten Malayalam character recognition based on chain code histogram. In: 2011 International Conference on Emerging Trends in Electrical and Computer Technology (ICETECT), pp. 736–741. IEEE (2011)
Kamble, P.M., Hegadi, R.S.: Handwritten Marathi character recognition using r-hog feature. Procedia Comput. Sci. 45, 266–274 (2015)
Kamble, P.M., Hegadi, R.S.: Comparative study of handwritten Marathi characters recognition based on KNN and SVM classifier. In: Santosh, K.C., Hangarge, M., Bevilacqua, V., Negi, A. (eds.) RTIP2R 2016. CCIS, vol. 709, pp. 93–101. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-4859-3_9
Kannan, R.J., Prabhakar, R.: An Improved Handwritten Tamil Character Recognition System Using Octal Graph (2008)
Kannan, R.J., Prabhakar, R., Suresh, R.: Off-line cursive handwritten Tamil character recognition. In: International Conference on Security Technology, 2008. SECTECH 2008, pp. 159–164. IEEE (2008)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Mantas, J.: An overview of character recognition methodologies. Pattern Recogn. 19(6), 425–430 (1986)
Murthy, K.N.: Natural Language Processing: An Information Access Perspective. Ess Ess Publications for Sarada Ranganathan Endowment For Library Science (2006)
Murthy, K.N., Srinivasu, B.: Roman transliteration of Indic scripts. In: 10th International Conference on Computer Applications, University of Computer Studies, Yangon, Myanmar, 28–29 February 2012 (2012)
Negi, A., Bhagvati, C., Krishna, B.: An OCR system for Telugu. In: Sixth International Conference on Document Analysis and Recognition, 2001. Proceedings, pp. 1110–1114. IEEE (2001)
Pal, U., Chaudhuri, B.: Indian script character recognition: a survey. Pattern Recogn. 37(9), 1887–1899 (2004)
Pal, U., Jayadevan, R., Sharma, N.: Handwriting recognition in Indian regional scripts: a survey of offline techniques. ACM Trans. Asian Lang. Inf. Process. (TALIP) 11(1), 1 (2012)
Patel, A., Sukumar, B., Bhagvati, C.: SVM with inverse fringe as feature for improving accuracy of Telugu OCR systems. In: Sa, P.K., Sahoo, M.N., Murugappan, M., Wu, Y., Majhi, B. (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. AISC, vol. 518, pp. 253–263. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-3373-5_25
Prakash, K.C., Srikar, Y., Trishal, G., Mandal, S., Channappayya, S.S.: Optical character recognition (ocr) for telugu: Database, algorithm and application. arXiv preprint arXiv:1711.07245 (2017)
Rajasekaran, S., Deekshatulu, B.: Recognition of printed Telugu characters. Computer Graph. Image Process. 6(4), 335–360 (1977)
Santosh, K.C.: Character recognition based on DTW-radon. In: 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 264–268. IEEE (2011)
Santosh, K.C., Wendling, L.: Character recognition based on non-linear multi-projection profiles measure. Front. Comput. Sci. 9(5), 678–690 (2015)
Singh, S.: Optical character recognition techniques: a survey. J. Emerg. Trends Comput. Inf. Sci. 4(6), 545–550 (2013)
Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new Arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, 2009. ICDAR 2009, pp. 946–950. IEEE (2009)
Srinivas, B.A., Agarwal, A., Rao, C.R.: An overview of OCR research in Indian scripts. IJCSES 2(2), 141–153 (2008)
Trier, O.D., Jain, A.K., Taxt, T., et al.: Feature extraction methods for character recognition-a survey. Pattern Recogn. 29(4), 641–662 (1996)
Acknowledgment
We thank Amit Patel for his efforts in labeling connected components. The first author acknowledges the financial support received from the Council of Scientific and Industrial Research (CSIR), Government of India in the form of a Junior Research Fellowship.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kummari, R., Bhagvati, C. (2019). UHTelPCC: A Dataset for Telugu Printed Character Recognition. In: Santosh, K., Hegadi, R. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2018. Communications in Computer and Information Science, vol 1037. Springer, Singapore. https://doi.org/10.1007/978-981-13-9187-3_3
Download citation
DOI: https://doi.org/10.1007/978-981-13-9187-3_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9186-6
Online ISBN: 978-981-13-9187-3
eBook Packages: Computer ScienceComputer Science (R0)