Word–Wise Script Identification from Indian Documents

  • Suranjit Sinha
  • Umapada Pal
  • B. B. Chaudhuri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)

Abstract

In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.

Keywords

Script Identification Indian script Bangla script Malayalam script Gujarati script Devnagari script Telugu script Multi-script OCR 

References

  1. 1.
    Chaudhuri, B.B., Pal, U.: A complete printed Bangla OCR system. Pattern Recognition 31, 531–549 (1998)CrossRefGoogle Scholar
  2. 2.
    Chaudhuri, B.B., Pal, U.: An OCR system to read two Indian language scripts: Bangla and Devnagari. In: Proc. International Conference on Document Analysis and Recognition (August 18-20, 1997)Google Scholar
  3. 3.
    Ding, J., Lam, L., Suen, C.Y.: Classification of oriental and European scripts by using characteristic features. In: Proc. 4th International Conference on Document Analysis and Recognition, pp. 1023–1027 (1997)Google Scholar
  4. 4.
    Hochberg, J., Kerns, L., Kelly, P., Thomas, T.: Automatic script identification from images using cluster-based templates. In: Proc. 3rd International Conference on Document Analysis and Recognition, pp. 378–381 (1995)Google Scholar
  5. 5.
    Pal, U., Sinha, S., Chaudhuri, B.B.: Multi-script line identification from Indian documents. In: Proc. 7th International Conference on Document Analysis and Recognition, pp. 880–884 (2003)Google Scholar
  6. 6.
    Pal, U., Belaïd, A., Choisy, C.: Touching numeral segmentation using water reservoir concept. Pattern Recognition Letters 24, 261–272 (2003)CrossRefGoogle Scholar
  7. 7.
    Spitz, A.: Multilingual Document Recognition. In: Furuta, R. (ed.) Electronic Publishing, Document Manipulation, and Typography, pp. 193–206. Cambridge Univ. Press, Cambridge (1990)Google Scholar
  8. 8.
    Spitz: Determination of the script and language content of document images. IEEE Trans. on Patt. Anal. and Mach. Intelligence 19, 235–245 (1997)CrossRefGoogle Scholar
  9. 9.
    Tan, T.N.: Rotation invariant texture features and their use in automatic script identification. IEEE Trans. on Patt. Anal. and Mach. Intelligence 20, 751–756 (1998)CrossRefGoogle Scholar
  10. 10.
    Wood, S., Yao, X., Krishnamurthi, K., Dang, L.: Language identification for printed text independent of segmentation. In: Proc. Int’l Conf. on Image Processing, pp. 428–431 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Suranjit Sinha
    • 1
  • Umapada Pal
    • 1
  • B. B. Chaudhuri
    • 1
  1. 1.Computer Vision and Pattern Recognition UnitIndian Statistical UnitKolkataIndia

Personalised recommendations