Skip to main content
Log in

Asking questions on handwritten document collections

  • Special Issue Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/facebookresearch/DrQA.

  2. https://pypi.org/project/Unidecode/.

  3. https://fonts.google.com/.

  4. https://annotate.deepset.ai.

  5. https://github.com/kris314/e2eEmbed.

  6. https://huggingface.co/transformers/pretrained_models.html.

  7. https://github.com/deepset-ai/haystack.

References

  1. Vincent, L.: Google book search: document understanding on a massive scale. In: ICDAR, (2007)

  2. Jaume, Kemal Ekenel,H. K.,Thiran, J.: FUNSD: A dataset for form understanding in noisy scanned documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), (2019)

  3. Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C. V.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: ICDAR, (2019)

  4. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: CoRR, vol. abs/1912.13318, (2019)

  5. Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images. In: WACV, (2021)

  6. Oliveira, D, A, B., Viana, M. P.: Fast CNN-based document layout analysis. In: ICCVW, (2017)

  7. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In: CoRR, vol. abs/1507.05717, (2015)

  8. Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: DAS, (2018)

  9. Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C, L.: Microsoft COCO captions: data collection and evaluation server. CoRR, vol. abs/1504.00325, (2015)

  10. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C, L., Parikh, D.: VQA: Visual question answering. In: ICCV, (2015)

  11. Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: Understanding data visualizations via question answering. In: CVPR, (2018)

  12. Singh, A.,Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR, (2019)

  13. Biten, A, F., Tito, R., Mafla, A., Gómez, L., Rusiñol, M., Valveny, E., Jawahar, C, V., Karatzas, D.: Scene text visual question answering. In: CoRR, vol. abs/1905.13648, (2019)

  14. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, (2016)

  15. Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: CoRR, vol. abs/1611.09268, (2016)

  16. Chen, D., Fisch, A., Weston, J., Bordes, A: Reading wikipedia to answer open-domain questions. In: ACL, (2017)

  17. Devlin, J., Chang, M. -W., Lee K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, (2019)

  18. Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C. V.: Document visual question answering challenge 2020. (2020)

  19. Wigington, C., Tensmeyer, C., Davis, B. L., Barrett, W. A., Price, B. L., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: ECCV, (2018)

  20. Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. (2020)

  21. Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: Jose, M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval. Nature Publishing Group, Berlin (2020)

    Google Scholar 

  22. Chiron, G., Doucet, A., Coustaty, M., M. Visani, M., Moreux, J. -P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: ACM/IEEE JCDL, (2017)

  23. Shih, K. J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. (2016)

  24. Wang, X., Liu, Y., Shen, C., Ng, C. C., Luo, C., Jin, L., Chan, C. S., van den Hengel, A., Wang, L.: On the general value of evidence, and bilingual scene-text visual question answering. (2020)

  25. Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Suleman, K.: Newsqa: A machine comprehension dataset. In: CoRR, vol. abs/1611.09830, (2016)

  26. Causer, T., Wallace, V.: Building a volunteer community: results and findings from transcribe Bentham. Digit. Humanit. Q. 6, 01 (2012)

    Google Scholar 

  27. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K.N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)

    Article  Google Scholar 

  28. Hermann, K. M., Kocisky, T.,Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M.,Blunsom, P.: Teaching machines to read and comprehend. In: NeurIPS, (2015)

  29. Shen, Y., Huang, P.-S., Gao, J., Chen, W.: Reasonet: Learning to stop reading in machine comprehension. In: ACM SIGKDD, (2017)

  30. Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, (2016)

  31. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NeurIPS, (2015)

  32. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NeurIPS, (2015)

  33. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, (2017)

  34. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., Girshick, R.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, (2017)

  35. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, (2018)

  36. Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J. P.: VizWiz Grand challenge: answering visual questions from blind people. In: CoRR, vol. abs/1802.08218, 2018

  37. Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J.,Popov, S., Kamali, S., Malloci, M., Pont-Tuset, J., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., Murphy, K.: OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available fromhttps://storage.googleapis.com/openimages/web/index.html, (2017)

  38. Kahou, S. E., Michalski, V., Atkinson, A., Trischler,K. A., Bengio, Y.: FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, (2017)

  39. Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: CVPR, (2017)

  40. Jain A, K., Namboodiri, A. M.: Indexing and retrieval of on-line handwritten documents. In: ICDAR, (2003)

  41. Howe, N. R., Rath, T. M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: SIGIR, (2005)

  42. Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit. 42(12), 3374–3382 (2009)

    Article  Google Scholar 

  43. Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A survey on handwritten documents word spotting. IJMIR 6(1), 31–47 (2017)

    Google Scholar 

  44. Villegas, M., Puigcerver, J., Toselli, A., Sánchez, J. -A., Vidal, E.: Overview of the ImageCLEF 2016 handwritten scanned document retrieval task. In: CLEF, (2016)

  45. Kise, K., Fukushima, S., Matsumoto, K.: Document image retrieval for QA systems based on the density distributions of successive terms. IEICE 88–D, 1843–1851 (2005)

    Google Scholar 

  46. Sudholt, S., Fink, G. A.: PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR, (2016)

  47. Sudholt, S., Fink, G. A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: ICDAR, (2017)

  48. Gómez, L., Rusiñol, M., Karatzas, D.: LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR, (2017)

  49. Jaakkola, T. S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NeurIPS, (1999)

  50. Perronnin F., Dance, C. R.: Fisher Kernels on visual vocabularies for image categorization. In: CVPR, (2007)

  51. Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput, Vis. 105, 222–245 (2013)

    Article  MathSciNet  Google Scholar 

  52. Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher Kernel for large-scale image classification. In: ECCV, (2010)

  53. Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. TPAMI 34, 1704–1716 (2012)

    Article  Google Scholar 

  54. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, (2014)

  55. Bradski, G.: The openCV library. Dr. Dobb’s J. Softw. Tools 25, 120–125 (2000)

    Google Scholar 

  56. Loper E., Bird, S.: NLTK: The natural language toolkit. In: ACL ETMTNLP

  57. Krishnan, P., Jawahar, C. V.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224, (2016)

  58. Marti, U.-V., Bunke, H.: The IAM-database: an english sentence database for offline handwriting recognition. IJDAR 5, 39–46 (2002)

    Article  Google Scholar 

  59. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2007)

    MATH  Google Scholar 

  60. Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2015)

    Google Scholar 

  61. Cao, H., Govindaraju, V., Bhardwaj, A.: Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011)

    Article  Google Scholar 

  62. Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Retrieving poorly degraded OCR documents. IJDAR 8, 15 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minesh Mathew.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mathew, M., Gomez, L., Karatzas, D. et al. Asking questions on handwritten document collections. IJDAR 24, 235–249 (2021). https://doi.org/10.1007/s10032-021-00383-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-021-00383-3

Keywords

Navigation