Skip to main content
Log in

Performance evaluation of different features and classifiers for Gurumukhi newspaper text recognition

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Document analysis is always a great area of enthusiasm for the researchers who are keen to innovate new techniques to archive the important information printed or handwritten on various documents. Newspapers are one of the sources that comprise of chronicled as well analytical data. Archiving such kind of information, through optical character recognition (OCR), can benefit us in future. In OCR, printed or handwritten text in consideration process through many phases to extract the recognizable unit. Finally, characters are recognized to generate a computer processable form of text. Feature extraction and classification phases are significant stages in which features of a segmented character image are extracted and fed to classifier for identification. In this presented work, various feature extraction and classification techniques have been implemented to recognize newspaper text printed in Gurumukhi script. Six types of feature extraction techniques namely Zoning, Diagonal, Centroid, power curve fitting, parabola curve fitting and peak extent method have been used for extracting features from a character image. Four classifiers namely k-nearest neighbor, multilayer perceptron, Decision tree and random forest classifier have been explored for classification purpose. Feature extraction techniques and classifiers are evaluated based on obtained recognition results. Maximum recognition accuracy 96.9% has been obtained using diagonal features with random forest classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin, pp 1–165

    Google Scholar 

  • Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795

    Article  Google Scholar 

  • Andersen T, Zhang W (2003) Features for neural net based region identification of newspaper documents. In: proceedings of Seventh International Conference on Document Analysis and Recognition, 403–407

  • Bansal G, Sharma D (2010) Isolated handwritten words segmentation techniques in Gurmukhi script. Int J Comput Appl 1(24):104–111

    Google Scholar 

  • Bansal S, Paliwal K (2018) Handwritten character recognition system using Gabor filter and SVM classifier. Int J Digit Appl Contemp Res 6(9):1–5

    Google Scholar 

  • Bledsoe WW, Browning I (1959) Pattern recognition and reading by machine. In: proceedings of the Eastern Joint Computer Conference, 225–232

  • Bunke H (2003) Recognition of cursive Roman handwriting: past, present and future. In: Proceedings of Seventh International Conference Document Analysis and Recognition, 448–459

  • Chaudhuri BB, Pal U, Mandar M (2002) Automatic recognition of printed Oriya script. Sadhana 27(1):23–34

    Article  Google Scholar 

  • Chung Y, Kim NR, Park CY, Lee JH (2018) Improved neighborhood search for collaborative filtering. Int J Fuzzy Logic Intell Syst 18(1):29–40

    Article  Google Scholar 

  • Dhir R, Singh C, Lehal GS (2004) A Structural Feature Based Approach for Script Identification of Gurmukhi and Roman Character and Words. In: Proceedings of 39th Annual National Convention of Computer Society of India (CSI) held at Mumbai, 123–126

  • Fukunaga K (2013) Introduction to Statistical Pattern Recognition. 2nd Edition. Elsevier

  • Gatos B, Louloudis G, Stamatopoulos N (2014) Segmentation of historical handwritten documents into text zones and text lines. In: proceedings of 14th International Conference on Frontiers in Handwriting Recognition, 464–469

  • Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th international conference on World Wide Web, 20–26

  • Hadjar K, Ingold R (2003) Arabic newspaper page segmentation. In: proceedings of 12th International Conference on Document Analysis and Recognition, 2:1186–1189

  • Hasnat AMD, Habib SMM, Khan M (2008) A high performance domain specific OCR for Bangla script. Novel Algorithms and Techniques in Telecommunications, Automation and Industrial Electronics, Springer Netherlands, 174–178

  • Heutte L, Paquet T, Moreau JV, Lecourtier Y, Olivier C (1998) A structural/statistical feature based vector for handwritten character recognition. Pattern Recogn Lett 19(7):629–641

    Article  Google Scholar 

  • Hewavitharana S, Fernando H (2002) A two stage classification approach to Tamil handwriting recognition. In: Tamil Internet 2002, California, USA, 118–124

  • Holambe AN, Thool RC, Jagade SM (2010) Printed and handwritten character & number recognition of devanagari script using gradient features. Int J Comput Appl 2(9):975–887

    Google Scholar 

  • Hussain E, Hannan A, Kashyap K (2015) A zoning based feature extraction method for recognition of handwritten assamese characters. Int J Comput Sci Technol 6(2):226–228

    Google Scholar 

  • Impedovo S, Pirlo G, Modugno R, Ferrante A (2010) Zoning Methods for Hand-Written Character Recognition: An Overview. In: 12th International Conference on Frontiers in Handwriting Recognition, Kolkata, 329–334

  • Jindal MK, Sharma RK, Lehal GS (2007) Segmentation of horizontally overlapping lines in printed Indian scripts. Int J Comput Intell Res 3(4):277–286

    Google Scholar 

  • Jindal MK, Sharma RK, Lehal GS (2009) Structural features for recognizing degraded printed Gurmukhi script. In: proceedings of Fifth International Conference Information Technology: New Generations, 668–673

  • Kabir KL, Shafin MK, Anannya TT, Debnath D, Kabir MR, Islam MA, Sarwar H (2015) Projection-based features: A superior domain for handwritten Bangla basic characters recognition. In: Proceedings of the 9th International Conference on Intelligent Systems and Control, 2:1054–1060

  • Kaur A, Sharma RK, Singh A (2010) A hybrid approach to classify gurmukhi script characters. Int J Recent Trends Eng Technol 3(2):103–105

    Google Scholar 

  • Kaur H, Rani S (2017) Handwritten Gurumukhi character recognition using convolution neural network. Int J Comput Intell Res 13(5):933–943

    Google Scholar 

  • Kaur RP, Jindal MK (2016) Problems in making OCR of Gurumukhi script newspapers. Int J Adv Res Comput Sci 7(6):6–22

    Google Scholar 

  • Kaur RP, Jindal MK (2019) Headline and Column Segmentation in Printed Gurumukhi Script Newspapers. In: proceedings of Smart Innovations in Communication and Computational Sciences, Springer, Singapore, 59–67

  • Kaur RP, Jindal MK, Kumar M (2018) Zone Segmentation of a Text Line Printed in Gurmukhi Script Newspaper. In: proceedings of Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC), 330–334

  • Khedekar S, Ramanaprasad V, Setlur S, Govindaraju V (2003) Text-image separation in Devanagari documents. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, 1265–1269

  • Kompalli S, Nayak S, Setlur S, Govindaraju V (2005) Challenges in OCR of Devanagari documents. In:proceedings of eighth International Conference on Document Analysis and Recognition, 327–331

  • Krishnamoorthy M, Nagy G, Seth S, Viswanathan M (1993) Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans Pattern Anal Mach Intell 15(7):737–747

    Article  Google Scholar 

  • Kumar BV, Ramakrishnan AG (2002) Machine recognition of printed Kannada text. International Workshop on Document Analysis Systems. Springer, Berlin, Heidelberg, pp 37–48

    Chapter  Google Scholar 

  • Kumar M, Sharma RK, Jindal MK (2014) Efficient feature extraction techniques for offline handwritten Gurmukhi character recognition. Natl Acad Sci Lett 37(4):381–391

    Article  Google Scholar 

  • Kumar M, Sharma RK, Kumar MG (2015) Offline handwritten Gurmukhi script recognition. (Doctoral dissertation)

  • Kumar M, Jindal MK, Sharma RK (2017) Offline handwritten Gurmukhi character recognition: analytical study of different transformations. Proc Na Acad Sci India Sect A: Phys Sci 87(1):137–143

    Google Scholar 

  • Kumar M, Jindal MK, Sharma RK, Jindal SR (2018) Offline handwritten numeral recognition using combination of different feature extraction techniques. Nat Acad Sci Lett 41(1):29–33

    Article  Google Scholar 

  • Lakshmi CV, Patvardhan C (2002) A multi-font OCR system for printed Telugu text. In: Language Engineering Conference, 7–17

  • Lam SW, Dacheng W, Sargur NS (1990) Reading newspaper text. Pattern Recognition. In: Proceedings of 10th International Conference on document analysis and recognition, (1):703–705

  • Lehal GS (2009) A Complete Machine-Printed Gurmukhi OCR System. In: Guide to OCR for Indic Scripts, 43–71

  • Lehal GS (2013) Ligature Segmentation for Urdu OCR. In: ICDAR, 1130–1134

  • Li X, Lei S (2001) Block-based segmentation and adaptive coding for visually lossless compression of scanned documents. Proc Int Conf Image Process 3:450–453

    Google Scholar 

  • Liu CL (2008) Handwritten Chinese character recognition: effects of shape normalization and feature extraction. In: Proceedings of Arabic and Chinese handwriting recognition, 104–128

  • Majumdar A (2007) Bangla basic character recognition using digital curvelet transform. J Pattern Recognit Res 2(1):17–26

    Article  Google Scholar 

  • Mehta B, Rani S (2014) Segmentation of broken characters of handwritten Gurmukhi script. Int J Eng Sci vidyapublications.com 3:95–105

  • Mitchell PE, Hong Y (2004) Newspaper layout analysis incorporating connected component separation. Image vis Comput 22(4):307–317

    Article  Google Scholar 

  • Mitchell PE, Hong Y (2001) Newspaper document analysis featuring connected line segmentation. In: Proceedings of the Pan-Sydney area workshop on Visual information processing, Australian Computer Society, 11:1181–1185

  • Mohanty S, Behera HK (2004) A complete OCR development system for Oriya script. In: Proceedings of SIMPLE (Symposium on Indian Morphology, Phonology & Language Engineering, Indian Institute of Technology, Kharagpur), 4:123–124

  • Negi A, Shanker KN, Chereddi CK (2003) Localization, extraction and recognition of text in Telugu document images. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, 1–5

  • Omee FY, Shiam S, Md AN (2013) An Algorithm for headline and column separation in bangladocuments. Intelligent Informatics. ASCI 182, Springer Berlin Heidelberg, 307–315

  • Pal U, Sarkar A (2003) Recognition of printed Urdu script. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2:1183–1187

  • Pradeep J, Srinivasan E, Himavathi S (2011) Diagonal based feature extraction for handwritten character recognition system using neural network. In: 3rd International Conference on Electronics Computer Technology (ICECT), 4:364–368.

  • Prasad JR, Kulkarni U (2015) Gujrati character recognition using weighted k-NN and mean χ 2 distance measure. Int J Mach Learn Cybern 6(1):69–82

    Article  Google Scholar 

  • Ramteke RJ (2010) Invariant moments based feature extraction for handwritten Devanagari vowels recognition. Int J Comput Appl 1(18):1–5

    Google Scholar 

  • Ramteke SP, Gurjar AA, Deshmukh DS (2018) A streamlined OCR system for handwritten Marathi text document classification and recognition using SVM-ACS algorithm. Int J Intell Eng Syst 11(3):186–195

    Google Scholar 

  • Rani A, Rani R, Dhir R (2012) Combination of different feature sets and SVM classifier for handwritten Gurumukhi numeral recognition. Int J Comput Appl 47(18):28–33

    Google Scholar 

  • Rani R, Renu D, Lehal GS (2011) Identification of printed Punjabi words and English numerals using gabor features. World Acad Sci Eng Technol 73:392–395

    Google Scholar 

  • Rege PP, Chandrakar CA (2012) Text-image separation in document images using boundary/perimeter detection. ACEEE Int J Signal Image Process 3(1):10–14

    Google Scholar 

  • Rollett JM (1991) U.S. Patent No. 5,065,431. Washington, DC: U.S. Patent and Trademark Office

  • Sarkhel R, Das N, Das A, Kundu M, Nasipuri M (2017) A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts. Pattern Recogn 71:78–93

    Article  Google Scholar 

  • Sharma DV, Lehal GS, Kathuria P (2009) Digit extraction and recognition from machine printed Gurmukhi documents. In: Proceedings of the International Workshop on Multilingual OCR at Catalonia, article no. 12

  • Sharma DV, Saini G, Joshi M (2012) Statistical feature extraction methods for isolated handwritten Gurumukhi script. Int J Eng Res Appl 2(4):380–384

    Google Scholar 

  • Shridhar M, Badreldin A (1986) Recognition of isolated and simply connected handwritten numerals. Pattern Recogn 19(1):1–12

    Article  Google Scholar 

  • Singh P, Budhiraja S (2011) Feature Extraction and Classification Techniques in OCR Systems for Handwritten Gurmukhi Script–A Survey. Int J Eng Res Appl (IJERA), 2248–9622

  • Singh PK, Sarkar R, Nasipuri M, Doermann D (2015) Word-level script identification for handwritten Indic scripts. In: International Conference on Document Analysis and Recognition (ICDAR), 1106–1110

  • Singh PK, Sarkar R, Nasipuri M (2016) A study of moment based features on handwritten digit recognition. Appl Comput Intell Soft Comput 1–17

  • Sundaresan CS, Keerthi SS (1999) A study of representations for pen based handwriting recognition of Tamil characters. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition, 422–425

  • Suresh RM, Ganesan L (2005) Recognition of printed and handwritten Tamil characters using fuzzy approach. In: Proceedings of Sixth International Conference on Computational Intelligence and Multimedia Applications, 291–296

  • Tarling R, Rohwer R (1993) Efficient use of training data in the n-tuple recognition method. Electron Lett 29(24):2093–2094

    Article  Google Scholar 

  • Ukil S, Ghosh S, Obaidullah SM, Santosh KC, Roy K, Das N (2020) Deep learning for word-level handwritten Indic script identification. In: International conference on recent trends in image processing and pattern recognition. Springer, Singapore, pp 499–510

    Google Scholar 

  • Wen Y, Lu Y, Shi P (2007) Handwritten Bangla numeral recognition system and its application to postal automation. Pattern Recogn 40(1):99–107

    Article  MATH  Google Scholar 

  • Wong KY, Casey RG, Wahl FM (1982) Document analysis system. IBM J Res Dev 26(6):647–656

    Article  Google Scholar 

  • Xi J, Jianming H, Lide W (2002) Page segmentation of Chinese newspapers. Pattern Recogn 35(12):2695–2704

    Article  MATH  Google Scholar 

  • Yang M, Kidiyo K, Joseph R (2008) A survey of shape feature extraction techniques. Pattern Recogn 15(7):43–90

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rupinder Pal Kaur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, R.P., Kumar, M. & Jindal, M.K. Performance evaluation of different features and classifiers for Gurumukhi newspaper text recognition. J Ambient Intell Human Comput 14, 10245–10261 (2023). https://doi.org/10.1007/s12652-021-03687-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03687-8

Keywords

Navigation