Language Identification in Degraded and Distorted Document Images

  • Shijian Lu
  • Chew Lim Tan
  • Weihua Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)


This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, character segmentation errors, and slight skew distortions. For each language studied, a word shape template and a word frequency template are firstly constructed based on the proposed word shape coding scheme. Identification is then accomplished based on Bray Curtis or Hamming distance between the word shape code of query images and the constructed word shape and frequency templates. Experiments show the average identification rate upon eight Latin-based languages reaches over 99%. ...


Text Image Query Image Document Image Text Line Word Image 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Cavnar, W., Trenkle, J.: N-Gram Based Text Categorization. In: 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175 (1994)Google Scholar
  2. 2.
    Dunning, T.: Statistical Identification of Language, Technical report, Computing Research Laboratory, New Mexico State University (1994)Google Scholar
  3. 3.
    Lee, D.S., Nohl, C.R., Baird, H.S.: Language Identification in Complex, Unoriented, and Degraded Document Images. In: International Workshop on Document Analysis Systems, Malvern, Penn-sylvania, pp. 76–98 (1996)Google Scholar
  4. 4.
    Hochberg, J., Kerns, L., Kelly, P., Thomas, T.: Automatic Script Identification from Images Using Cluster-based Templates. IEEE PAMI 19(2), 176–181 (1997)Google Scholar
  5. 5.
    Spitz, A.L.: Determination of the Script and Language Content of Document Images. IEEE PAMI 19(3), 235–245 (1997)Google Scholar
  6. 6.
    Tan, T.N.: Rotation Invariant Texture Features and Their Use in Automatic Script Identification. IEEE PAMI 20(7), 751–756 (1998)Google Scholar
  7. 7.
    Nobile, N., Bergler, S., Suen, C.Y., Khoury, S.: Language Identification of On-Line Documents Using Word Shapes. In: 4th ICDAR, Ulm, Germany, pp. 258–262 (1997)Google Scholar
  8. 8.
    Suen, C.Y., Bergler, S., Nobile, N., Waked, B., Nadal, C.P., Bloch, A.: Categorizing Document Images Into Script and Language Classes. In: International Conference on Ad-vances in Pattern Recognition, Plymouth, England, November 1998, pp. 297–306 (1998)Google Scholar
  9. 9.
    Powalka, R.K., Sherkat, N., Whitrow, R.J.: Word Shape Analysis for a Hybrid Recognition System. Pattern Recognition 30(3), 421–445 (1997)CrossRefGoogle Scholar
  10. 10.
    Lu, S.J., Chen, B.M., Ko, C.C.: Perspective Rectification of Document Images Using Fuzzy Set and Morphological Operations. Image and Vision Computing 23(5), 541–553 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shijian Lu
    • 1
  • Chew Lim Tan
    • 1
  • Weihua Huang
    • 1
  1. 1.School of ComputingNational University of SingaporeSingapore

Personalised recommendations