Language Modelling for the Needs of OCR of Medical Texts

  • Maciej Piasecki
  • Grzegorz Godlewski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4345)


In the paper different methods of construction of language models are discussed in relation to a corpora of medical texts written in an inflective language, namely Polish. The main result is the proposal of a method of language modelling which sequentially combines tri-grams of morphological base forms with tri-grams of words. The introduction of base form tri-grams increased the overall performance of the combined model, measured as the improvement in the accuracy of OCR of handwriting, as well, as the ability to generalisation. The latter was showed by using corpora of two different types as the training one and the test one. The detailed results of tests run on a large corpora of real life medical language are discussed in the paper. An experimental system of OCR of handwritten epicrises utilising the proposed model is presented. The proposed language model decreases the overall error of the system by 64.2% (51% in the case of different types of corpora).


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bunke, H.: Recognition of cursive roman handwriting - past, present and future. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), vol. 1, pp. 448–460. IEEE, Los Alamitos (2003)CrossRefGoogle Scholar
  2. 2.
    Koerich, A.L., Sabourin, R., Suen, C.Y.: Large vocabulary off-line handwriting recognition: A survey. Pattern. Anal. Applic. 6, 97–121 (2003)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)Google Scholar
  4. 4.
    Vinciarelli, A., Bengio, S., Bunke, H.: Off-line recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(6), 709–720 (2004)CrossRefGoogle Scholar
  5. 5.
    Koerich, A.L.: Rejection strategies for handwrittenword recognition. In: Proceedings of the 9th International Workshop on Frontiers in Handwriting Recognition, October 26-29, 2004, Kokubunji, Tokyo, Japan (2004)Google Scholar
  6. 6.
    Karacs, K., Prószéky, G., Roska, T.: Intimate integration of shape codes and linguistic framework in handwriting recognition via wave computers. In: Proceedings of the European Conference on Circuit Theory and Design, September 1-4, 2003, Kraków, Poland (2003)Google Scholar
  7. 7.
    Pal, U., Kundu, P.K., Chaudhuri, B.B.: Ocr error correction of an inflectional indian language using morphological parsing. OCR Error Correction Journal of Information Science and Engineering 16, 903–922 (2000)Google Scholar
  8. 8.
    Piasecki, M., Godlewski, G., Pejcz, J.: Corpus of medical texts and tools. In: Proceedings of Medical Informatics and Technologies 2006, Silesian University of Technology (2006)Google Scholar
  9. 9.
    Piasecki, M., Godlewski, G.: Effective architecture of the polish tagger. In: [16]Google Scholar
  10. 10.
    Woliński, M.: Morfeusz — a practical tool for the morphological analysis of polish. In: [17]Google Scholar
  11. 11.
    Przepiórkowski, A.: The IPI PAN Corpus Preliminary Version. Institute of Computer Science PAS (2004)Google Scholar
  12. 12.
    Godlewski, G., Piasecki, M., Sas, J.: Application of syntactic properties to three-level recognition of polish hand-written medical texts. In: Bulterman, D., Brailsford, D.F. (eds.) Proceedings of the 2005 ACM symposium on Document engineering, ACM Press, New York (2006)Google Scholar
  13. 13.
    Jelinek, F.: Statistical Methods for Speech Recognition. The MIT Press, Cambridge (1997)Google Scholar
  14. 14.
    Brants, T.: TnT — a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000). Seattle (2000)Google Scholar
  15. 15.
    Debowski, L.: Trigram morphosyntactic tagger for Polish. In: Mieczyslaw, A., Klopotek, W.S.T., Trojanowski, K. (eds.) Proceedings of the International IIS:IIPWM 2004 Conference of Intelligent Information Processing and Web Mining, Zakopane, Poland, May 17-20, pp. 409–413. Springer, Heidelberg (2004)Google Scholar
  16. 16.
    Sojka, P., Kopeček, I., Pala, K. (eds.): TSD 2006. LNCS (LNAI), vol. 4188. Springer, Heidelberg (2006)Google Scholar
  17. 17.
    Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.): Intelligent Information Processing and Web Mining. In: Proceedings of the International IIS: IIPWM 2006 Conference held in Zakopane, Poland (June 2006); Advances in Soft Computing. Springer, Berlin (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Maciej Piasecki
    • 1
  • Grzegorz Godlewski
    • 1
  1. 1.Institute of Applied InformaticsWrocław University of TechnologyWrocławPoland

Personalised recommendations