Advertisement

Script Identification in Printed Bilingual Documents

  • D. Dhanya
  • A. G. Ramakrishnan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)

Abstract

Identification of script in multi-lingual documents is essential for many language dependent applications suchas machine translation and optical character recognition. Techniques for script identification generally require large areas for operation so that sufficient information is available. Suchassumption is nullified in Indian context, as there is an interspersion of words of two different scripts in most documents. In this paper, techniques to identify the script of a word are discussed. Two different approaches have been proposed and tested. The first method structures words into 3 distinct spatial zones and utilizes the information on the spatial spread of a word in upper and lower zones, together with the character density, in order to identify the script. The second technique analyzes the directional energy distribution of a word using Gabor filters withsuitable frequencies and orientations. Words withv arious font styles and sizes have been used for the testing of the proposed algorithms and the results obtained are quite encouraging.

Keywords

Machine Translation Human Visual System Text Line Optical Character Recognition Character Density 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 2.
    Spitz, A.L.: Determination of Script and Language Content of Document Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 235–245CrossRefGoogle Scholar
  2. 3.
    Sibun, P., Spitz, A.L.: Natural Language Processing from Scanned Document Images. In: Proceedings of the Applied Natural Language Processing, Stuttgart (1994) 115–121Google Scholar
  3. 4.
    Nakayama, T., Spitz, A.L.: European Language Determination from Image. In: Proceedings of the International Conference on Document Analysis and Recognition, Japan (1993) 159–162Google Scholar
  4. 5.
    Hochberg, J., et al.: Automatic Script Identification from Images Using Cluster-Based Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 176–181CrossRefGoogle Scholar
  5. 6.
    Dang, L., et al.: Language Identification for Printed Text Independent of Segmentation. In: Proceedings of the International Conference on Image Processing. (1995) 428–431Google Scholar
  6. 7.
    Tan, C.L., et al.: Language Identification in Multi-lingual Documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 751–756CrossRefGoogle Scholar
  7. 8.
    Tan, T.N.: Rotation Invariant Texture Features and their Use in Automatic Script Identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 751–756CrossRefGoogle Scholar
  8. 9.
    Chaudhuri, B.B., Pal, U.: A complete Printed bangla OCR System. Pattern Recognition 31 (1998) 531–549CrossRefGoogle Scholar
  9. 10.
    Chaudhuri, B.B., Pal, U.: Automatic Separation of Words in Multi-lingual Multiscript Indian Documents. In: Proceedings of the International Conference on Document Analysis and Recognition, Germany (1997) 576–579Google Scholar
  10. 11.
    Chaudhury, S., Sheth, R.: Trainable Script Identification Strategies for Indian languages. In: Proceedings of the International Conference on Document Analysis and Recognition, India (1999) 657–660Google Scholar
  11. 12.
    Hubel, D.H., Wiesel, T.N.: Receptive Fields and Functional Architecture in Two Non-striate Visual Areas 18 and 19 of the Cat. Journal of Neurophysiology 28 (1965) 229–289Google Scholar
  12. 13.
    Campbell, F.W., Kulikowski, J.J.: Orientational Selectivity of Human Visual System. Journal of Physiology 187 (1966) 437–445Google Scholar
  13. 14.
    Chen, Y.K., et al.: Skew Detection and Reconstruction Based on Maximization of Variance of Transition-Counts. Pattern Recognition 33 (2000) 195–208CrossRefGoogle Scholar
  14. 15.
    Dhanya, D.: Bilingual OCR for Tamil and Roman Scripts. Master’s thesis, Department of Electrical Engineering, Indian Institute of Science (2001)Google Scholar
  15. 16.
    Burges, C.J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 955–974CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • D. Dhanya
    • 1
  • A. G. Ramakrishnan
    • 1
  1. 1.Department of Electrical EngineeringIndian Institute of ScienceBangaloreIndia

Personalised recommendations