Advertisement

Bangla/English Script Identification Based on Analysis of Connected Component Profiles

  • Lijun Zhou
  • Yue Lu
  • Chew Lim Tan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

Script identification is required for a multilingual OCR system. In this paper, we present a novel and efficient technique for Bangla/English script identification with applications to the destination address block of Bangladesh envelope images. The proposed approach is based upon the analysis of connected component profiles extracted from the destination address block images, however, it does not place any emphasis on the information provided by individual characters themselves and does not require any character/line segmentation. Experimental results demonstrate that the proposed technique is capable of identifying Bangla/English scripts on the real Bangladesh postal images.

Keywords

Document Image Text Line Text Block Handwritten Text Postal Stamp 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Hochberg, J., Kelly, P., Thomas, T., Kerns, L.: Automatic Script Identification From Document Images Using Cluster-Based Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 176–181 (1997)Google Scholar
  2. 2.
    Spitz, A.L.: Determination of the Script and Language Content of Document Images. IEEE Trans. Pattern Analysis and Machine Intelligence, 235–245 (1997)Google Scholar
  3. 3.
    Lee, S.W., Kim, J.S.: Multi-lingual, multi-font and multi-size large-set character recognition using self-organizing neural network. In: Proceedings of International Conference on Document Analysis and Recognition, vol. 1, pp. 28–33 (1995)Google Scholar
  4. 4.
    Liu, Y.H., Lin, C.C., Chang, F.: Language Identification of Character Images Using Machine Learning Techniques. In: Proceedings of 8th Intl. Conf. Document Analysis and Recognition, pp. 630–634 (2005)Google Scholar
  5. 5.
    John, M.P.: Linguini: Language Identification for Multilingual Documents. In: Proceedings of 32nd Hawaii International Conference on System Sciences, vol. 2, pp. 2035–2045 (1999)Google Scholar
  6. 6.
    Elgammal, A.M., Ismail, M.A.: Techniques for Language Identification for Hybrid Arabic-English Document Images. In: IEEE Proceedings of the Sixth International Conference on Document Analysis and Recognition, pp. 1100–1104 (2001)Google Scholar
  7. 7.
    Tan, C.L., Leong, T.Y., He, S.: Language identification in multilingual documents. In: Proceedings of International Symposium on Intelligent Multimedia and Distance Education (ISIMADE 1999), pp. 59–64 (1999)Google Scholar
  8. 8.
    Peake, G.S., Tan, T.N.: Script and Language Identification from Document Images. In: Proceedings of the Workshop on Document Image Analysis, pp. 10–17 (1997)Google Scholar
  9. 9.
    Singhal, V., Navin, N., Ghosh, D.: Script-based classification of Hand-written Text Document in a Multilingual Environment. In: Research Issues in Data Engineering, pp. 47–54 (2003)Google Scholar
  10. 10.
    Wood, S.L., Yao, X., Krishnamurthi, K., Dang, L.: Language identification for printed text independent of segmentation. In: Proceedings of the International Conference on Image Processing, vol. 3, pp. 3428–3431 (1995)Google Scholar
  11. 11.
    Ding, J., Lam, L., Suen, C.Y.: Classification of Oriental and European Scripts by Using Characteristic Features. In: Proceedings of fourth International Conference Document Analysis and Recognition, pp. 1023–1027 (1997)Google Scholar
  12. 12.
    Pal, U., Chaudhuri, B.B.: Script Line Separation from Indian Multi-Script Documents. In: Proceedings of fifth Intl. Conf. Document Analysis and Recognition, pp. 406–409 (1999)Google Scholar
  13. 13.
    Pal, U., Chaudhuri, B.B.: Automatic Identification of English, Chinese, Arabic, Devnagari and Bangla Script Line. In: Intl. Conf. Document Analysis and Recognition, pp. 790–794 (2001)Google Scholar
  14. 14.
    Pal, U., Sinha, S., Chaudhuri, B.B.: Multi-Script Line identification from Indian Documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 880–884 (2003)Google Scholar
  15. 15.
    Chaudhury, S., Sheth, R.: Trainable Script Identification Strategies for Indian Languages. In: Proceedings of 5th International Conference on Document Analysis and Recognation, pp. 657–660 (1999)Google Scholar
  16. 16.
    Kanoun, S., Ennaji, A., LeCourtier, Y., Alimi, A.M.: Script and Nature Differentiation for Arabic and Latin Text Images. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, pp. 309–313 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Lijun Zhou
    • 1
  • Yue Lu
    • 1
    • 2
  • Chew Lim Tan
    • 3
  1. 1.Department of Computer Science and TechnologyEast China Normal UniversityShanghaiChina
  2. 2.Shanghai Research Institute of Postal ScienceChina State Post BureauShanghaiChina
  3. 3.Department of Computer Science, School of ComputingNational University of SingaporeSingapore

Personalised recommendations