Using topic models for OCR correction

  • Faisal FarooqEmail author
  • Anurag Bhardwaj
  • Venu Govindaraju
Original Paper


Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.


OCR correction Topic models Lexicon reduction Language models Document categorization Handwritten documents Unconstrained handwriting 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kim G., Govindaraju V., Srihari S.: Architecture for handwriting recognition systems. Int. J. Doc. Anal. Recognit. 2(1), 37–44 (1999)CrossRefGoogle Scholar
  2. 2.
    Senior A., Robinson A.: An off-line cursive handwriting recognition system. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 309–321 (1998)CrossRefGoogle Scholar
  3. 3.
    Srihari, S., Keubert, E.: Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 892–896. Ulm, Germany (1997)Google Scholar
  4. 4.
    Impedovo, S., Wang, P.S.P., Bunke, H. (eds.): Automatic Bankcheck Processing. Series in Machine Perception and Artificial Intelligence, vol. 28. World Scientific (1997)Google Scholar
  5. 5.
    Govindaraju, V., Ramanaprasad, V., Lee, D., Srihari, S.: Reading handwritten us census forms. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 82–85. Montreal, Canada (1997)Google Scholar
  6. 6.
    Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)CrossRefGoogle Scholar
  7. 7.
    Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  8. 8.
    Perez-Cortes, J., Amerngual, J., Arlandis, J., Llobet, R.: Stochastic error-correcting parsing for OCR postprocessing. In: International Conference on Pattern Recognition, pp. 4405–4408. Barcelona, Spain (2000)Google Scholar
  9. 9.
    Pal U., Kundu P., Chaudhuri B.: OCR error correction of an inflectional Indian language using morphological parsing. J. Inform. Sci. Eng. 16(6), 903–922 (2000)Google Scholar
  10. 10.
    Taghva K., Stofsky E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3(3), 125–137 (2001)CrossRefGoogle Scholar
  11. 11.
    Farooq, F., Jose, D., Govindaraju, V.: Phrase based direct model for improving handwriting recognition accuracies. In: Proceedings of International Conference on Frontiers in Handwriting Recognition. Montreal, Canada (2008)Google Scholar
  12. 12.
    Wick, M., Ross, M., Learned-Miller, E.: Context-sensitive error correction: using topic models to improve OCR. In: Proceedings of 9th International Conference on Document Analysis and Recognition, pp. 1168–1172. Brazil (2007)Google Scholar
  13. 13.
    Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)CrossRefGoogle Scholar
  14. 14.
    Koerich A., Sabourin R., Suen C.: Large vocabulary offline handwriting recognition using a constrained level building algorithm. Pattern Anal. Appl. 6(2), 97–121 (2003)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an hmm-framework based on quantized feature vectors. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1097–1101. Ulm, Germany (1997)Google Scholar
  16. 16.
    Powalka N.S.R.K., Whitrow R.J.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)CrossRefGoogle Scholar
  17. 17.
    Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition, pp. 373–382 (2000)Google Scholar
  18. 18.
    Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction for handwritten word recognition. In: Proceedings of the SPIE-Document Recognition III, pp. 224–234. San Jose, CA (1996)Google Scholar
  19. 19.
    Madhvanath S., Govindaraju V.: Syntatic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)zbMATHCrossRefGoogle Scholar
  20. 20.
    Milewski, R., Setlur, S., Govindaraju, V.: A lexicon reduction strategy in the context of handwritten medical forms. In: Proceedings of Eigth International Conference on Document Analysis and Recognition, pp. 1146–1150. Seoul, Korea (2005)Google Scholar
  21. 21.
    Yang Y., Chute C.: An example-based mapping method for text categorization and retrieval. ACM Trans. Inform. Syst. 12(3), 252–277 (1994)CrossRefGoogle Scholar
  22. 22.
    McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of AAAI Workshop on Learning for Text Categorization, pp. 41–48. Madison, USA (1998)Google Scholar
  23. 23.
    Price, R., Zukas, A.: Accurate document categorization of OCR generared text. In: Proceedings of Symposium on Document Image Understanding Technology, pp. 97–102. Maryland, USA (2005)Google Scholar
  24. 24.
    Manning C.D., Raghavan P., Schtze H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHGoogle Scholar
  25. 25.
    Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of Workshop on Machine Learning for Information Filtering-IJCAI, pp. 61–67. Stockholm, Sweden (1999)Google Scholar
  26. 26.
    Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. In: IRCS Report 97–08. University of Pennsylvania (1997)Google Scholar
  27. 27.
    Marti U., Bunke H.: The IAM-database: an english sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recognit. 5, 39–46 (2002)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Faisal Farooq
    • 1
    Email author
  • Anurag Bhardwaj
    • 2
  • Venu Govindaraju
    • 2
  1. 1.Image and Knowledge ManagementSiemens Medical SolutionsMalvernUSA
  2. 2.Department of Computer Science and EngineeringUniversity at BuffaloBuffaloUSA

Personalised recommendations