Automatic recognition of handwritten medical forms for search engines

  • Robert Jay Milewski
  • Venu Govindaraju
  • Anurag Bhardwaj
Original Paper

Abstract

A new paradigm, which models the relationships between handwriting and topic categories, in the context of medical forms, is presented. The ultimate goals are: (1) a robust method which categorizes medical forms into specified categories, and (2) the use of such information for practical applications such as an improved recognition of medical handwriting or retrieval of medical forms as in a search engine. Medical forms have diverse, complex and large lexicons consisting of English, Medical and Pharmacology corpus. Our technique shows that a few recognized characters, returned by handwriting recognition, can be used to construct a linguistic model capable of representing a medical topic category. This allows (1) a reduced lexicon to be constructed, thereby improving handwriting recognition performance, and (2) PCR (Pre-Hospital Care Report) forms to be tagged with a topic category and subsequently searched by information retrieval systems. We present an improvement of over 7% in raw recognition rate and a mean average precision of 0.28 over a set of 1,175 queries on a data set of unconstrained handwritten medical forms filled in emergency environments.

Keywords

Handwriting analysis Language models Pattern matching Retrieval models Search process 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bahlmann, C., Haasdonk, B., Burkhardt, H.: On-line handwriting recognition with support vector machines—a kernel approach. International Workshop On Frontiers in Handwriting Recognition (2002)Google Scholar
  2. 2.
    Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Proceedings of Seventh IAPR Workshop on Document Analysis Systems, pp. 1–12 (2006)Google Scholar
  3. 3.
    Bayer T., Kressel U., Mogg-Schneider H., Renz I.: Categorizing paper documents. Comput. Vis. Image Understand. 70(3), 299–306 (1998)CrossRefGoogle Scholar
  4. 4.
    Black, P.E. (ed.): Levenshtein distance. Algorithms and Theory of Computation Handbook; CRC Press LLC, dictionary of Algorithms and Data Structures, NIST (1999)Google Scholar
  5. 5.
    Blum, J.R., Rosenblatt, J.I.: Probability and statistics. Random Variables and Their Distributions, chap. 4. Expectations, Moment Generating Functions, and Quantiles, chap. 6. W.B. Saunders Company, USA (1972)Google Scholar
  6. 6.
    Blumenstein, M., Verma, S.: A neural based segmentation and recognition technique for handwritten words. IEEE Int. Conf. Neural Netw. (1998)Google Scholar
  7. 7.
    Byun H., Lee S.W.: Applications of support vector machines for pattern recognition: a survey. Lecture Notes in Computer Science. Springer, Berlin (2002)Google Scholar
  8. 8.
    Caesar, T., Gloger, J.M., Mandler, E.: Using lexical knowledge for the recognition of poorly written words. In: Third International Conference on Document Analysis and Recognition, vol. 2, pp. 915–918 (1995)Google Scholar
  9. 9.
    Chu-Carroll, J., Carpenter, B.: Dialogue management in vector-based call routing. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pp. 256–262 (1999)Google Scholar
  10. 10.
    Chu-Carroll J., Carpenter B.: Vector-based natural language call routing. Comput. Linguist. 25(3), 361–388 (1999)Google Scholar
  11. 11.
    Chen M.Y., Jundu A., Zhou J.: Off-line handwritten word recognition using a hidden markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 481–496 (1994)CrossRefGoogle Scholar
  12. 12.
    Cho, S.B., Kim, J.H.: Applications of neural networks to character recognition. Pattern Recognit. (1991)Google Scholar
  13. 13.
    Cho S.B.: Neural-network classifiers for recognizing totally unconstrained handwritten numerals. IEEE Trans. Neural Netw. 8(1), 43–53 (1997)CrossRefGoogle Scholar
  14. 14.
    Croft, B., Harding, S.M., Taghva, K., Borsack, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of Symposium on Document Analysis and Information Retrieval, pp. 115–126 (1994)Google Scholar
  15. 15.
    Deerwester S., Dumais S.T., Furnas G.Q., Landauer T.K., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  16. 16.
    Doermann D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Understand. 70(3), 287–298 (1998)CrossRefGoogle Scholar
  17. 17.
    Edwards, J., Forsyth, D.: Searching for character models. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 331–338 (2005)Google Scholar
  18. 18.
    Edwards, J., Teh, Y.W., Forsyth, D., Bock, R., Maire, M., Vesom, G.: Making Latin manuscripts searchable using (gHMM)’s. In: Proceedings of the 18th Annual Conference on Neural Information Processing Systems, pp. 385–392 (2004)Google Scholar
  19. 19.
    Fagan J.: The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. J. Am. Soc. Inf. Sci. 40, 115–132 (1989)CrossRefGoogle Scholar
  20. 20.
    Favata J.T.: Offline general handwritten word recognition using an approximate BEAM matching algorithm. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 23(9), 1009–1021 (2001)CrossRefGoogle Scholar
  21. 21.
    Feng, S.L., Manmatha, R.: Classification models for historic manuscript recognition. In: Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR) (2005)Google Scholar
  22. 22.
    Gader P.D., Keller J.M., Krishnapuram R., Chiang J.H., Mohamed M.A.: Neural and fuzzy methods in handwriting recognition. Computer 30(2), 79–86 (1997)CrossRefGoogle Scholar
  23. 23.
    Goldman, R. Shivakumar, N., Venkatasubramanian, S., Garcia-Molina, H.: Proximity search in databases. IEEE Proc. Int. Conf. Very Large Databases, pp. 26–37 (1998)Google Scholar
  24. 24.
    Golub G.B., Van Loan C.E.: Matrix Computations, 2nd edn. John Hopkins University Press, Baltimore (1989)MATHGoogle Scholar
  25. 25.
    Govindaraju, V., Xue, H.: Fast handwriting recognition for indexing historical documents. In: First International Workshop on Document Image Analysis for Libraries (DIAL) (2004)Google Scholar
  26. 26.
    Govindaraju V., Slavik P., Xue H.: Use of lexicon density in evaluating word recognizers. IEEE Trans. PAMI 24(6), 789–800 (2002)Google Scholar
  27. 27.
    Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: Seventh International Workshop on Frontiers in Handwriting Recognition, Amsterdam (2000)Google Scholar
  28. 28.
    Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of OCR degraded textt using n-grams. In: Research and Advanced Technology for Digital Libraries, pp. 345–359 (1997)Google Scholar
  29. 29.
    Hersh W.R.: Information Retrieval: A Health and Biomedical Perspective, 2nd edn. Springer-Verlag, New York, Inc. USA (2003)Google Scholar
  30. 30.
    Howe, N.R., Rath, T. M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: Proceedings of the 28th Annual Int’l ACM SIGIR Conference, pp. 377–383 (2006)Google Scholar
  31. 31.
    Hu J., Brown M.K., Turin W.: HMM based online handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 18(10), 1039–1045 (1996)CrossRefGoogle Scholar
  32. 32.
    Jones K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Document. 28(1), 11–20 (1972)CrossRefGoogle Scholar
  33. 33.
    Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)Google Scholar
  34. 34.
    Kaufmann, G., Bunke, H., Madom, M.: Lexicon reduction in an HMM-Framework based on quantized feature vectors. In: Proceedings of the 4th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1097–1101 (1997)Google Scholar
  35. 35.
    Kim G., Govindaraju V.: Bank check recognition using cross validation between legal and courtesy amounts. IJPRAI 11(4), 657–674 (1997)Google Scholar
  36. 36.
    Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. PAMI 19(4), 366–379 (1997)Google Scholar
  37. 37.
    Koerich, A.L., Sabourin, R., Suen, C.Y.: Fast two-level HMM decoding algorithm for large vocabulary handwriting recognition. Ninth International Workshop on Frontiers in Handwriting Recognition (IWFHR-9), pp. 232–237 (2004)Google Scholar
  38. 38.
    Koerich A.L., Sabourin R., Suen C.Y.: Large vocabulary off-line handwriting recognition: a survey. Pattern Anal. Appl. 6, 97–121 (2003)CrossRefMathSciNetGoogle Scholar
  39. 39.
    Larson, R.E., Hostetler, R.P., Edwards, B.H.: Calculus with Analytic Geometry, chap. 13, sect. 13.9, 5th edn. D.C. Heath and Company, USA (1994)Google Scholar
  40. 40.
    Lopresti, D., Zhou, J.: Retrieval strategies for noisy text. In: Proceedings of Symposium on Document Analysis and Information Retrieval, pp. 255–270 (1996)Google Scholar
  41. 41.
    Luhn H.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Madhvanath, S.: The holistic paradigm in handwritten word recognition and its application to large and dynamic lexicon scenarios. Ph.D. Dissertation, University at Buffalo Computer Science and Engineering (1997)Google Scholar
  43. 43.
    Madvanath, S., Krpasundar, V., Govindaraju, V.: Syntactic methodology of pruning large lexicons in cursive script recognition. J. Pattern Recognit. Soc. Pattern Recognition, vol. 34. Elsevier Science, Amsterdam (2001)Google Scholar
  44. 44.
    Marti, U.V., Bunke, H.: Using a Statistical Language Model to Improve the Performance of an HMM-based Cursive Handwriting Recognition Systems. World Scientific Series in Machine Perception and Artificial Intelligence Series (2001)Google Scholar
  45. 45.
    Milewski, R., Govindaraju, V.: Medical word recognition using a computational semantic lexicon. In: Eighth International Workshop on Frontiers in Handwriting Recognition, Canada (2002)Google Scholar
  46. 46.
    Milewski, R., Govindaraju, V.: Handwriting analysis of pre-hospital care reports. In: IEEE Proceedings of the Seventeenth IEEE Symposium on Computer-Based Medical Systems (CBMS) (2004)Google Scholar
  47. 47.
    Milewski R., Govindaraju V.: Extraction of handwritten text from carbon copy medical forms. Document Analysis Systems (DAS). Springer, Berlin (2006)Google Scholar
  48. 48.
    Nakai, M., Akira, N., Shimodaira, H., Sagayama, S.: Substroke approach to HMM-based on-line kanji handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition (2001)Google Scholar
  49. 49.
    National Library of Medicine. PubMed Stop ListGoogle Scholar
  50. 49.
    Oh I.-S., Suen C.Y.: Distance features for neural network-based recognition of handwritten characters. Int. J. Doc. Anal. Recognit. (IJDAR) 1(2), 73–88 (2004)CrossRefGoogle Scholar
  51. 50.
    Okuda, T. Tanaka, E. Kasai, T.: A method for the correction of garbled words based on the levenshtein distance. IEEE Trans. Comput., Col. C-25, No. 2 (1976)Google Scholar
  52. 51.
    Porter M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  53. 52.
    Rath, T.M., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of IEEE International Conference on Document Analysis and Recognition, pp. 218–222 (2003)Google Scholar
  54. 53.
    Rath T.M., Manmatha R.: Word spotting for historical documents. IJDAR 9(2), 139–152 (2007)CrossRefGoogle Scholar
  55. 54.
    Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, vol.2, pp. 521–527, Madison, WI (2003)Google Scholar
  56. 55.
    Rath, T.M., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: ACM SIGR, pp. 369–376 (2004)Google Scholar
  57. 56.
    Rijsbergen, C.J. van, Robertson, S.E., Porter, M.F.: New models in probabilistic information retrieval. British Library, London (1980)Google Scholar
  58. 57.
    Russell, G., Perrone, M.P., Chee, Y.M.: Handwritten document retrieval. In: Proceedings of International Workshop on Frontiers in Handwriting Recognition, pp. 233–238 (2002)Google Scholar
  59. 58.
    Salton G.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  60. 59.
    Sinha R.M.K., Prasada B.: Visual text recognition through contextual processing. Pattern Recognit. 21(5), 463–479 (1988)CrossRefGoogle Scholar
  61. 60.
    Srihari S.N., Hull J.J., Choudhari R.: Integrating diverse knowledge sources in text recognition. ACM Trans. Office Inf. Syst. 1(1), 68–87 (1983)CrossRefGoogle Scholar
  62. 61.
    Suen C.Y.: N-gram statistics for natural language understanding and processing. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 164–172 (1979)CrossRefGoogle Scholar
  63. 62.
    Taghva, K., Narkter, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Proceedings of IS&T SPIE 2001 International Symposium on Electronic Imaging Science and Technology, pp. 68–74 (2001)Google Scholar
  64. 63.
    Tan C.L., Huang W., Yu Z., Xu Y.: Imaged document text retrieval without OCR. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 838–844 (2002)CrossRefGoogle Scholar
  65. 64.
    Vinciarelli A.: Application of information retrieval techniques to single writer documents. Pattern Recognit. Lett. 26(14–15), 2262–2271 (2005)CrossRefGoogle Scholar
  66. 65.
    Vinciarelli A.: Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1882–1295 (2005)CrossRefGoogle Scholar
  67. 66.
    Western Regional Emergency Medical Services. Bureau of Emergency Medical Services. New York State (NYS) Department of Health (DoH). Prehospital Care Report v4Google Scholar
  68. 67.
    Xue, H., Govindaraju, V.: Stochastic models combining discrete symbols and continuous attributes—application in handwriting recognition. In: Proceedings of 5th IAPR International Workshop on Document Analysis Systems, pp. 70–81 (2002)Google Scholar
  69. 68.
    Xue H., Govindaraju V.: On the dependence of handwritten word recognizers on lexicons. IEEE Trans. PAMI 24(12), 1553–1564 (2002)Google Scholar
  70. 69.
    Yates B.R., Ribeiro-Neto B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  71. 70.
    Zimmermann M., Mao J.: Lexicon reduction using key characters in cursive handwritten words. Pattern Recognit. Lett. 20, 1297–1304 (1999)CrossRefGoogle Scholar
  72. 71.
    Zobel J., Dart P.: FInding approximate matches in large lexicons. Softw. Pract. Experience 25(3), 331–345 (1995)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Robert Jay Milewski
    • 1
  • Venu Govindaraju
    • 1
  • Anurag Bhardwaj
    • 1
  1. 1.Center of Excellence for Document Analysis and Recognition, UB CommonsAmherstUSA

Personalised recommendations