Advertisement

Cluster Computing

, Volume 21, Issue 1, pp 503–514 | Cite as

Urdu ligature recognition using multi-level agglomerative hierarchical clustering

  • Naila Habib KhanEmail author
  • Awais Adnan
  • Sadia Basar
Article
  • 267 Downloads

Abstract

Optical character recognition (OCR) system holds great significance in human-machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature-level recognition of Urdu text. This ligature based recognition approach overcomes the character-levelsegmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naive Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naive Bayes and K-NN respectively.

Keywords

Agglomerative Clustering Classification OCR Urdu 

References

  1. 1.
    Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)CrossRefGoogle Scholar
  2. 2.
    Olszewska, J.I.: Active contour based optical character recognition for automated scene understanding. Neurocomputing 161, 65–71 (2015)CrossRefGoogle Scholar
  3. 3.
    Kharma, N.N., Ward, R.K.: Character recognition systems for the non-expert. IEEE Can. Rev. 33, 5–8 (1999)Google Scholar
  4. 4.
    Ahmad, R., Naz, S., Afzal, M.Z., Amin, S.H., Breuel, T.: Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach. PLoS ONE 10(9), e0133648 (2015)CrossRefGoogle Scholar
  5. 5.
    Choudhary, P., Nain, N.: A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on Urdu Script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 26 (2016)CrossRefGoogle Scholar
  6. 6.
    Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Siddiqi, I., Razzak, M.I.: Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016)CrossRefGoogle Scholar
  7. 7.
    Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21 (2016)CrossRefGoogle Scholar
  8. 8.
    Ahmad, Z., Orakzai, J.K., Shamsher, I., Adnan, A.: Urdu Nastaleeq Optical Character Recognition. In: Proceedings of World Academy of Science, Engineering and Technology, pp. 249–252 (2007)Google Scholar
  9. 9.
    Husain, S.A.: A multi-tier holistic approach for Urdu Nastaliq recognition. In: Proceedings of the 8th International Multi Topic Conference, Abstracts 2002, pp. 79–84 (2002)Google Scholar
  10. 10.
    Shah, Z.A.: Ligature based optical character recognition of Urdu-Nastaleeq font. In: Proceedings of 6th International Multitopic IEEE Conference (INMIC) (2002)Google Scholar
  11. 11.
    Husain, S.A., Sajjad, A., Anwar, F.: Online Urdu character recognition system. In: MVA2007 IAPR Conference on Machine Vision Applications (2007)Google Scholar
  12. 12.
    Khan, K., Siddique, M., Aamir, M., Khan, R.: An efficient method for Urdu language text search in image based Urdu text. IJCSI Int. J. Comput. Sci. Issues 9(2), 523–527 (2012)Google Scholar
  13. 13.
    Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online Urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8, 3149–3161 (2012)Google Scholar
  14. 14.
    Razzak, M.I., Anwar, F., Husain, S.A., Belaid, A., Sher, M.: HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages’ character recognition. Knowl Based Syst. 23(8), 914–923 (2010). doi: 10.1016/j.knosys.2010.06.007 CrossRefGoogle Scholar
  15. 15.
    Akram, Q.u.A., Hussain, S., Habib, Z.: Font size independent OCR for Noori Nastaleeq. In: Proceedings of Graduate Colloquium on Computer Sciences (GCCS), NUCES, Lahore (2010)Google Scholar
  16. 16.
    Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation Free Nastalique Urdu OCR. In: Proceedings of World Academy Of Science, Engineering and Technology, vol. 70 (2010)Google Scholar
  17. 17.
    Sattar, S.A., Haque, S., Pathan, M.K.: A finite state model for Urdu Nastalique optical character recognition. Int. J. Comput. Sci. Netw. Security 9(9), 116 (2009)Google Scholar
  18. 18.
    Pal, U., Sarkar, A.: Recognition of Printed Urdu Script. Paper presented at the Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2 (2003)Google Scholar
  19. 19.
    Malik, S., Khan, S.A.: Urdu online handwriting recognition. In: Proceedings of the IEEE Symposium on Emerging Technologies, vol. 17(18), Islamabad (2005)Google Scholar
  20. 20.
    Chanda, S., Pal, U.: English, Devnagari and Urdu text identification. In: Proceedings of the International Conference on Cognition and Recognition, pp. 538–546 (2005)Google Scholar
  21. 21.
    Pathan, R.R.J.I.K., Ali, A.A.: Recognition of offline handwritten isolated Urdu character. Adv. Comput. Res. 4(1), 117–121 (2012)Google Scholar
  22. 22.
    Zaman, S., Slany, W., Sahito, F.: Recognition of segmented Arabic/Urdu characters using pixel values as their features. In: ICCIT (2012)Google Scholar
  23. 23.
    Shahzad, N., Paulson, B., Hammond, T.: Urdu Qaeda: Recognition system for isolated Urdu characters. In: IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida (2009)Google Scholar
  24. 24.
    Nawaz, T., Naqvi, S.A.H.S., ur Rehman, H.: Optical character recognition system for Urdu (Naskh Font) using pattern matching technique. Int. J. Image Process. 3, 92–104 (2009)Google Scholar
  25. 25.
    Ahmad, Z., Orakzai, J.K., Shamsher, I.: Urdu compound character recognition using feed forward neural networks. In: ICCSIT 2009, pp. 457–462 (2009)Google Scholar
  26. 26.
    Shamsher, I., Ahmad, Z., Orakzai, J.K., Adnan, A.: OCR for printed Urdu Script using feed forward neural network. In: Proceedings of World Academy of Science, Engineering and Technology (2007)Google Scholar
  27. 27.
    Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation free nastalique urdu OCR. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 46, pp. 456–461 (2010)Google Scholar
  28. 28.
    Ahmed, S.B., Naz, S., Razzak, M.I., Rashid, S.F., Afzal, M.Z., Breuel, T.M.: Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016)CrossRefGoogle Scholar
  29. 29.
    Javed, S.T., Hussain, S.: Segmentation based Urdu Nastalique OCR. In: Iberoamerican Congress on Pattern Recognition 2013, pp. 41–49. Springer, Heidelberg (2013)Google Scholar
  30. 30.
    Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8(5), 21 (2012)Google Scholar
  31. 31.
    Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. pp. 53–58. Springer, Heidelberg (2007)Google Scholar
  32. 32.
    Hussain, S.: Complexity of Asian writing systems: a case study of Nafees Nasta’leeq for urdu. In: Proceedings of the 12th AMIC Annual Conference on e-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore 2003. CiteseerGoogle Scholar
  33. 33.
    Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Madani, S.A., Khan, S.U.: The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47(3), 12291248 (2014)CrossRefGoogle Scholar
  34. 34.
    Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Akbar, H.: Arabic script based character segmentation: a review. In: 2013 IEEE World Congress on Computer and Information Technology (WCCIT), pp. 1–6 (2013)Google Scholar
  35. 35.
    Satti, D.A., Saleem, K.: Complexities and implementation challenges in offline Urdu Nastaliq OCR. In: Proceedings of the Conference on Language & Technology 2012, pp. 85–91 (2012)Google Scholar
  36. 36.
    Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging 2013. International Society for Optics and Photonics, pp. 86580N-86580N-86512 (2013)Google Scholar
  37. 37.
    Akram, M., Hussain, S.: Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, pp. 88–94 (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceInstitute of Management SciencesPeshawarPakistan
  2. 2.Department of Information TechnologyHazara UniversityMansehraPakistan

Personalised recommendations