Correction of Medical Handwriting OCR Based on Semantic Similarity

  • Bartosz Broda
  • Maciej Piasecki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4881)


In the paper a method of the correction of handwriting Optical Character Recognition (OCR) based on the semantic similarity is presented. Different versions of the extraction of semantic similarity measures from a corpus are analysed, with the best results achieved for the combination of the text window context and Rank Weight Function. An algorithm of the word sequence selection with the high internal similarity is proposed. The method was trained and applied to a corpus of real medical documents written in Polish.


semantic similarity handwriting OCR correction Polish 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Godlewski, G., Piasecki, M., Sas, J.: Application of syntactic properties to three-level recognition of Polish hand-written medical texts. In: Bulterman, D., Brailsford, D.F. (eds.) Proc. of the 2005 ACM Symposium on Document Engineering, ACM Press, New York (2006)Google Scholar
  2. 2.
    Piasecki, M., Godlewski, G., Pejcz, J.: Corpus of medical texts and tools. Proceedings of Medical Informatics and Technologies, Silesian University of Technology 2006, 281–286 (2006)Google Scholar
  3. 3.
    Piasecki, M., Godlewski, G.: Language modelling for the needs of OCR of medical texts. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds.) ISBMDA 2006. LNCS (LNBI), vol. 4345, pp. 7–8. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)Google Scholar
  5. 5.
    Harris, Z.S.: Mathematical Structures of Language. Interscience Publishers, New York (1968)zbMATHGoogle Scholar
  6. 6.
    Piasecki, M., Szpakowicz, S., Broda, B.: Automatic selection of heterogeneous syntactic features in semantic similarity of Polish nouns. In: Matousek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, Springer, Heidelberg (2007)Google Scholar
  7. 7.
    Piasecki, M., Broda, B.: Semantic similarity measure of Polish nouns based on linguistic features. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, Springer, Heidelberg (2007)Google Scholar
  8. 8.
    Dagan, I., Lee, L., Pereira, F.: Similarity-based method for word sense disambiguation. In: Proc. of the 35th Annual Meeting of the ACL, Madrid, Spain, ACL, pp. 56–63 (1997)Google Scholar
  9. 9.
    Woliński, M.: Morfeusz — a practical tool for the morphological analysis of Polish [20], 511–520Google Scholar
  10. 10.
    Cox, S., Dasmahapatra, S.: High-level approaches to confidence estimation in speech recognition. Speech and Audio Processing, IEEE Transactions 10(7), 460–471 (2002)CrossRefGoogle Scholar
  11. 11.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition. Psychological Review 104(2), 211–240 (1997)CrossRefGoogle Scholar
  12. 12.
    Kupiec, J., Kimber, D., Balasubramanian, V.: Speech-based retrieval using semantic co-occurrence filtering. In: Proceedings of the workshop on Human Language Technology, pp. 373–377 (1994)Google Scholar
  13. 13.
    Inkpen, D., Désilets, A.: Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 49–56 (2005)Google Scholar
  14. 14.
    Jobbins, A., Raza, G., Evett, L., Sherkat, N.: Postprocessing for OCR: Correcting Errors Using Semantic Relations. In: LEDAR. Language Engineering for Document Analysis and Recognition, AISB 1996 Workshop, Sussex, England (1996)Google Scholar
  15. 15.
    Jeong, M., Kim, B., Lee, G.: Semantic-Oriented Error Correction for Spoken Query Processing. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161 (2003)Google Scholar
  16. 16.
    Kolak, O., Byrne, W., Resnik, P.: A generative probabilistic OCR model for NLP applications. In: Proc. of the 2003 Conf. of the North American Chapter of the ACL on Human Language Technology, vol. 1, pp. 55–62 (2003)Google Scholar
  17. 17.
    Hirst, G., Budanitsky, A.: Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering 11(01), 87–111 (2005)CrossRefGoogle Scholar
  18. 18.
    Jones, M., Martin, J.: Contextual spelling correction using latent semantic analysis. In: Proc. of the 5th Conf. on Applied Natural Language Processing, pp. 166–173 (1997)Google Scholar
  19. 19.
    Al-Mubaid, H., Truemper, K.: Learning to Find Context-Based Spelling Errors. Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques (2006)Google Scholar
  20. 20.
    Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.): Intelligent Information Processing and Web Mining — Proc. of the International IIS: IIPWM 2006. Advances in Soft Computing, Zakopane, Poland, June. Springer, Berlin (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Bartosz Broda
    • 1
  • Maciej Piasecki
    • 1
  1. 1.Institute of Applied Informatics, Wrocław University of TechnologyPoland

Personalised recommendations