Integrating natural language processing with image document analysis: what we learned from two real-world applications

  • Jinying Chen
  • Huaigu Cao
  • Premkumar Natarajan
Original Paper


Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.


Image document analysis Machine translation Keyword search Natural language processing Optical handwriting recognition 


  1. 1.
    Al-Subaihin, A.A., Al-Khalifa, H.S., Al-Salman, A.S.: Sentence boundary detection in colloquial arabic text: a preliminary result. In: Proceedings of 2011 International Conference on Asian Language Processing, pp. 30–32 (2011)Google Scholar
  2. 2.
    Alotaiby, F., Alkharashi, I., Foda, S.: Processing large arabic text corpora: Preliminary analysis and results. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, pp. 78–82 (2009)Google Scholar
  3. 3.
    Béchet, F., Gorin, A.L., Wright, J.H., Hakkani Tür, D.: Detecting and extracting named entities from spontaneous speech in a mixed-initiative spoken dialogue context: how may i help you? Speech Commun. 42(2), 207–225 (2004)CrossRefGoogle Scholar
  4. 4.
    Bhardwaj, A., Setlur, S., Govindaraju, V.: Keyword spotting techniques for sanskrit documents. In: Sanskrit Computational Linguistics, pp. 403–416. Springer, Berlin (2009)Google Scholar
  5. 5.
    Cao, H., Chen, J., Devlin, J., Prasad, R., Natarajan, P.: Document recognition and translation system for unconstrained arabic documents. In: Proceedings of Pattern Recognition (ICPR), 21st International Conference on IEEE, pp. 318–321 (2012)Google Scholar
  6. 6.
    Cao, H., Natarajan, P., Peng, X., Belanger, K.S.D., Li, N.: Progress in the raytheon bbn arabic online handwriting recognition system. In: Proceedings of Frontiers in Handwriting Recognition (ICFHR), 14th International Conference on IEEE, pp. 555–560 (2014)Google Scholar
  7. 7.
    Chan, J., Ziftci, C., Forsyth, D.: Searching online arabic documents. In: Proceedings of Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on IEEE, vol. 2, pp. 1455–1462 (2006)Google Scholar
  8. 8.
    Chen, J., Cao, H., Prasad, R., Bhardwaj, A., Natarajan, P.: Gabor features for online arabic handwriting recognition. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 53–58 (2010)Google Scholar
  9. 9.
    Chen, J., Cao, H., Wu, Y., Natarajan, P.: Confusion network based recurrent neural network language modeling for chinese ocr error detection. In: Proceedings of ICPR, pp. 1266–1271 (2014)Google Scholar
  10. 10.
    Chen, J., Prasad, R., Cao, H., Natarajan, P.: Detecting oov names in arabic handwritten data. In: Proceedings of Document Analysis and Recognition (ICDAR), 12th International Conference on IEEE, pp. 994–998 (2013)Google Scholar
  11. 11.
    Chen, W., Ananthakrishnan, S., Kumar, R., Prasad, R., Natarajan, P.: Asr error detection in a conversational spoken language translation system. In: Proceedings of Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on IEEE, pp. 7418–7422 (2013)Google Scholar
  12. 12.
    Chiang, D., Knight, K., Wang, W.: 11,001 new features for statistical machine translation. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 218–226 (2009)Google Scholar
  13. 13.
    Ding, W., Suen, C.Y., Krzyzak, A.: A new courtesy amount recognition module of a check reading system. In: Proceedings of ICPR, pp. 1–4 (2008)Google Scholar
  14. 14.
    Farooq, F., Al-Onaizan, Y.: Effect of degraded input on statistical machine translation. In: 2005 Symposium on Document Image Understanding Technology, pp. 103–109 (2005)Google Scholar
  15. 15.
    Gales, M.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)CrossRefGoogle Scholar
  16. 16.
    Grefenstette, G., Tapanainen, P.: What is a word, what is a sentence? Problems of tokenization. COMPLEX 1994, 79–87 (1994)Google Scholar
  17. 17.
    Huang, L., Yin, F., Chen, Q.H., Liu, C.L.: Keyword spotting in online chinese handwritten documents using a statistical model. In: Proceedings of ICDAR, pp. 78–82 (2011)Google Scholar
  18. 18.
    Khalifa, I., Feki, Z.A., Farawila, A.: Arabic discourse segmentation based on rhetorical methods. Int. J. Electr. Comput. Sci. 11(1), 10–15 (2011)Google Scholar
  19. 19.
    Kim, G., Govindaraju, V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)CrossRefGoogle Scholar
  20. 20.
    Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Comput. Linguist. 32(4), 485–525 (2006)CrossRefGoogle Scholar
  21. 21.
    Kubala, F., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from speech. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 287–292 (1998)Google Scholar
  22. 22.
    Lafferty, J.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  23. 23.
    Lin, H., Bilmes, J., Vergyri, D., Kirchhoff, K.: Oov detection by joint word/phone lattice alignment. In: Proceedings of Automatic Speech Recognition and Understanding (ASRU), IEEE Workshop on IEEE, pp. 478–483 (2007)Google Scholar
  24. 24.
    Liu, C.L., Koga, M., Fujisawa, H.: Gabor feature extraction for character recognition: comparison with gradient feature. In: Proceedings of Document Analysis and Recognition, 8th International Conference on IEEE, pp. 121–125 (2005)Google Scholar
  25. 25.
    Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)zbMATHMathSciNetCrossRefGoogle Scholar
  26. 26.
    Liu, Y., Shriberg, E.: Comparing evaluation metrics for sentence boundary detection. In: Proceedings of ICASSP, pp. 451–458 (2007)Google Scholar
  27. 27.
    Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Harper, M.P.: Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio, Speech Lang. Process. 14(5), 1526–1540 (2006)CrossRefGoogle Scholar
  28. 28.
    Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Comput. Speech Lang. 14(4), 373–400 (2000)CrossRefGoogle Scholar
  29. 29.
    Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The det curve in assessment of detection task performance. In: Proceedings of Eurospeech. Rhodes, Greece, pp. 1895–1898 (1997)Google Scholar
  30. 30.
    Matsoukas, S., Bulyko, I., Xiang, B., Nguyen, K., Schwartz, R., Makhoul, J.: Integrating speech recognition and machine translation. In: Proceedings of ICASSP, pp. 1281–1284 (2007)Google Scholar
  31. 31.
    Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani- Tür, D.Z., Ostendorf, M., Ney, H.: Improving speech translation with automatic boundary prediction. In: INTERSPEECH, vol. 7, pp. 2449–2452 (2007)Google Scholar
  32. 32.
    McCallum, A., Li, W.: Early results for named entity recognition with conditional random elds, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural language learning at HLT-NAACL 2003, vol. 4, pp. 188–191 (2003)Google Scholar
  33. 33.
    Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271 (2000)Google Scholar
  34. 34.
    Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Comput. Linguist. 23(2), 241–267 (1997)Google Scholar
  35. 35.
    Parada, C., Dredze, M., Filimonov, D., Jelinek, F.: Contextual information improves oov detection in speech. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 216–224 (2010)Google Scholar
  36. 36.
    Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for oov detection using hybrid word/fragment system. In: Proceedings of ICASSP, pp. 3953–3956 (2009)Google Scholar
  37. 37.
    Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence boundary detection: a long solved problem? In: Proceedings of COLING, pp. 985–994 (2012)Google Scholar
  38. 38.
    Roark, B., Liu, Y., Harper, M., Stewart, R., Lease, M., Snover, M., Shafran, I., Dorr, B., Hale, J., Krasnyanskaya, A.K., et al.: Reranking for sentence boundary detection in conversational speech. In: Proceedings of ICASSP, vol. 1, pp. I-I (2006)Google Scholar
  39. 39.
    Saleem, S., Cao, H., Subramanian, K., Kamali, M., Prasad, R., Natarajan, P.: Improvements in bbn’s hmmbased online arabic handwriting recognition system. In: Proceedings of Document Analysis and Recognition, the 10th International Conference on IEEE, pp. 773–777 (2009)Google Scholar
  40. 40.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)Google Scholar
  41. 41.
    Shen, L., Xu, J., Weischedel, R.M.: A new string-to-dependency machine translation algorithm with a target dependency language model. In: Proceedings of ACL, pp. 577–585 (2008)Google Scholar
  42. 42.
    Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1), 127–154 (2000)CrossRefGoogle Scholar
  43. 43.
    Subramanian, K., Prasad, R., MacRostie, E., Natarajan, P.: Robust named entity detection in videotext using character lattices. In: Proceedings of ICASSP, pp. 1241–1244 (2008)Google Scholar
  44. 44.
    Subramanian, K., Prasad, R., Natarajan, P.: Robust named entity detection from optical character recognition output. Int. J. Doc. Anal. Recognit. (IJDAR) 14(2), 189–200 (2011)CrossRefGoogle Scholar
  45. 45.
    Sun, H., Zhang, G., Zheng, F., Xu, M.: Using word confidence measure for oov words detection in a spontaneous spoken dialog system. In: Proceedings of Eurospeech. Geneva, pp. 2713–2716 (2003)Google Scholar
  46. 46.
    Toselli, A.H., Vidal, E.: Fast hmm-filler approach for key word spotting in handwritten documents. In: Proceedings of ICDAR, pp. 501–505 (2013)Google Scholar
  47. 47.
    Touir, A.A., Mathkour, H., Al-Sanea, W.: Semantic-based segmentation of arabic texts. Inf. Technol. J. 7, 1009–1015 (2008)Google Scholar
  48. 48.
    Tseng, H.: A conditional random field word segmenter. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  49. 49.
    Walker, D.J., Clements, D.E., Darwin, M., Amtrup, J.W.: Sentence boundary detection: a comparison of paradigms for improving mt quality. In: Proceedings of the MT Summit VIII (2001)Google Scholar
  50. 50.
    Zhang, H., Liu, C.L.: A lattice-based method for keyword spotting in online chinese handwriting. In: Proceedings of ICDAR, pp. 1064–1068 (2011)Google Scholar
  51. 51.
    Zimmermann, M.: Sentence boundary detection for handwritten text recognition. In: Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft (2006)Google Scholar
  52. 52.
    Zhou, B., Besacier, L., Gao, Y.: On efficient coupling of asr and smt for speech translation. In: Proceedings of ICASSP, vol. 4, pp. IV-101 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Quantitative Health SciencesUniversity of Massachusetts Medical SchoolWorcesterUSA
  2. 2.Department of Speech, Language and MultimediaRaytheon BBN TechnologiesCambridgeUSA
  3. 3.Information Sciences InstituteMarina del ReyUSA

Personalised recommendations