Learning on the fly: a font-free approach toward multilingual OCR

Original Paper

Abstract

Despite ubiquitous claims that optical character recognition (OCR) is a “solved problem,” many categories of documents continue to break modern OCR software such as documents with moderate degradation or unusual fonts. Many approaches rely on pre-computed or stored character models, but these are vulnerable to cases when the font of a particular document was not part of the training set or when there is so much noise in a document that the font model becomes weak. To address these difficult cases, we present a form of iterative contextual modeling that learns character models directly from the document it is trying to recognize. We use these learned models both to segment the characters and to recognize them in an incremental, iterative process. We present results comparable with those of a commercial OCR system on a subset of characters from a difficult test document in both English and Greek.

Keywords

Character recognition OCR Cryptogram Font-free models Multilingual OCR 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Breuel, T.: Classification by probabilistic clustering. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2001)Google Scholar
  5. 5.
    Casey, R.: Text OCR by solving a cryptogram. In: International Conference on Pattern Recognition (1986)Google Scholar
  6. 6.
    Ho, T.K.: Bootstrapping text recognition from stop words. In: International Conference on Pattern Recognition (1998)Google Scholar
  7. 7.
    Ho, T.K., Nagy, G.: OCR with no shape training. In: International Conference on Pattern Recognition (2000)Google Scholar
  8. 8.
    Hobby, J., Ho, T.: Enhancing degraded document images via bitmap clustering and averaging. In: International Conference on Document Analysis and Recognition (1997)Google Scholar
  9. 9.
    Hong, T., Hull, J.: Character segmentation using visual inter-word constraints in a text page. In: Proceedings of SPIE (International Society for Optics and Photonics) (1995)Google Scholar
  10. 10.
    Huang, G., Learned-Miller, E., McCallum, A.: Cryptogram decoding for optical character recognition. In: International Conference on Document Analysis and Recognition (2007)Google Scholar
  11. 11.
    Jacobs, C., Simard, P., Viola, P., Rinker, J.: Text recognition of low-resolution document images. In: International Conference on Document Analysis and Recognition, pp. 695–699 (2005)Google Scholar
  12. 12.
    Kae, A., Learned-Miller, E.: Learning on the fly: font free approaches to difficult OCR problems. In: International Conference on Document Analysis and Recognition (2009)Google Scholar
  13. 13.
    Lee, D.: Substitution deciphering based on HMMs with applications to compressed document processing. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12) (2002)Google Scholar
  14. 14.
    MacKay D.: Entropy, time and information (introduction to discussion). Inf. Theory, Trans. IRE Prof. Group 1(1), 162–165 (1953)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Nagy, G.: Efficient algorithms to decode substitution ciphers with applications to OCR. In: International Conference on Pattern Recognition (1986)Google Scholar
  16. 16.
    Rice S.V., Jenkins F.R., Jenkins F.R., Nartker T.A., Nartker T.A.: The fifth annual test of OCR accuracy. Tech. Rep. University of Nevada, Las Vegas (1996)Google Scholar
  17. 17.
    Valetta J.N.: Homer’s Life and Poems. Oxford University, Oxford (1867)Google Scholar
  18. 18.
    Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Proceedings of the 2nd Annual Workshop on Very Large Text Corpora. Las Cruces, pp. 99–120 (1994)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Andrew Kae
    • 1
  • David A. Smith
    • 1
  • Erik Learned-Miller
    • 1
  1. 1.Department of Computer ScienceUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations