Asking questions on handwritten document collections

Mathew, Minesh; Gomez, Lluis; Karatzas, Dimosthenis; Jawahar, C. V.

doi:10.1007/s10032-021-00383-3

Asking questions on handwritten document collections

Special Issue Paper
Published: 06 August 2021

Volume 24, pages 235–249, (2021)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Minesh Mathew ORCID: orcid.org/0000-0002-0809-2590¹,
Lluis Gomez²,
Dimosthenis Karatzas² &
…
C. V. Jawahar¹

841 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recognition-Free Question Answering on Handwritten Document Collections

HWNet v2: an efficient word image representation for handwritten documents

Article 31 July 2019

Handwritten Text Retrieval from Unlabeled Collections

Notes

References

Vincent, L.: Google book search: document understanding on a massive scale. In: ICDAR, (2007)
Jaume, Kemal Ekenel,H. K.,Thiran, J.: FUNSD: A dataset for form understanding in noisy scanned documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW), (2019)
Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C. V.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: ICDAR, (2019)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: CoRR, vol. abs/1912.13318, (2019)
Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images. In: WACV, (2021)
Oliveira, D, A, B., Viana, M. P.: Fast CNN-based document layout analysis. In: ICCVW, (2017)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In: CoRR, vol. abs/1507.05717, (2015)
Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: DAS, (2018)
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C, L.: Microsoft COCO captions: data collection and evaluation server. CoRR, vol. abs/1504.00325, (2015)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C, L., Parikh, D.: VQA: Visual question answering. In: ICCV, (2015)
Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: Understanding data visualizations via question answering. In: CVPR, (2018)
Singh, A.,Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR, (2019)
Biten, A, F., Tito, R., Mafla, A., Gómez, L., Rusiñol, M., Valveny, E., Jawahar, C, V., Karatzas, D.: Scene text visual question answering. In: CoRR, vol. abs/1905.13648, (2019)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, (2016)
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: CoRR, vol. abs/1611.09268, (2016)
Chen, D., Fisch, A., Weston, J., Bordes, A: Reading wikipedia to answer open-domain questions. In: ACL, (2017)
Devlin, J., Chang, M. -W., Lee K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: ACL, (2019)
Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C. V.: Document visual question answering challenge 2020. (2020)
Wigington, C., Tensmeyer, C., Davis, B. L., Barrett, W. A., Price, B. L., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: ECCV, (2018)
Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. (2020)
Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: Jose, M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Retrieval. Nature Publishing Group, Berlin (2020)
Google Scholar
Chiron, G., Doucet, A., Coustaty, M., M. Visani, M., Moreux, J. -P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: ACM/IEEE JCDL, (2017)
Shih, K. J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. (2016)
Wang, X., Liu, Y., Shen, C., Ng, C. C., Luo, C., Jin, L., Chan, C. S., van den Hengel, A., Wang, L.: On the general value of evidence, and bilingual scene-text visual question answering. (2020)
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Suleman, K.: Newsqa: A machine comprehension dataset. In: CoRR, vol. abs/1611.09830, (2016)
Causer, T., Wallace, V.: Building a volunteer community: results and findings from transcribe Bentham. Digit. Humanit. Q. 6, 01 (2012)
Google Scholar
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K.N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)
Article Google Scholar
Hermann, K. M., Kocisky, T.,Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M.,Blunsom, P.: Teaching machines to read and comprehend. In: NeurIPS, (2015)
Shen, Y., Huang, P.-S., Gao, J., Chen, W.: Reasonet: Learning to stop reading in machine comprehension. In: ACM SIGKDD, (2017)
Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, (2016)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NeurIPS, (2015)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NeurIPS, (2015)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, (2017)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., Girshick, R.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, (2017)
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, (2018)
Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J. P.: VizWiz Grand challenge: answering visual questions from blind people. In: CoRR, vol. abs/1802.08218, 2018
Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J.,Popov, S., Kamali, S., Malloci, M., Pont-Tuset, J., Veit, A., Belongie, S., Gomes, V., Gupta, A., Sun, C., Chechik, G., Cai, D., Feng, Z., Narayanan, D., Murphy, K.: OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available fromhttps://storage.googleapis.com/openimages/web/index.html, (2017)
Kahou, S. E., Michalski, V., Atkinson, A., Trischler,K. A., Bengio, Y.: FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, (2017)
Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: CVPR, (2017)
Jain A, K., Namboodiri, A. M.: Indexing and retrieval of on-line handwritten documents. In: ICDAR, (2003)
Howe, N. R., Rath, T. M., Manmatha, R.: Boosted decision trees for word recognition in handwritten document retrieval. In: SIGIR, (2005)
Cao, H., Bhardwaj, A., Govindaraju, V.: A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit. 42(12), 3374–3382 (2009)
Article Google Scholar
Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A survey on handwritten documents word spotting. IJMIR 6(1), 31–47 (2017)
Google Scholar
Villegas, M., Puigcerver, J., Toselli, A., Sánchez, J. -A., Vidal, E.: Overview of the ImageCLEF 2016 handwritten scanned document retrieval task. In: CLEF, (2016)
Kise, K., Fukushima, S., Matsumoto, K.: Document image retrieval for QA systems based on the density distributions of successive terms. IEICE 88–D, 1843–1851 (2005)
Google Scholar
Sudholt, S., Fink, G. A.: PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR, (2016)
Sudholt, S., Fink, G. A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: ICDAR, (2017)
Gómez, L., Rusiñol, M., Karatzas, D.: LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR, (2017)
Jaakkola, T. S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NeurIPS, (1999)
Perronnin F., Dance, C. R.: Fisher Kernels on visual vocabularies for image categorization. In: CVPR, (2007)
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput, Vis. 105, 222–245 (2013)
Article MathSciNet Google Scholar
Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher Kernel for large-scale image classification. In: ECCV, (2010)
Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. TPAMI 34, 1704–1716 (2012)
Article Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, (2014)
Bradski, G.: The openCV library. Dr. Dobb’s J. Softw. Tools 25, 120–125 (2000)
Google Scholar
Loper E., Bird, S.: NLTK: The natural language toolkit. In: ACL ETMTNLP
Krishnan, P., Jawahar, C. V.: Generating synthetic data for text recognition. arXiv preprint arXiv:1608.04224, (2016)
Marti, U.-V., Bunke, H.: The IAM-database: an english sentence database for offline handwriting recognition. IJDAR 5, 39–46 (2002)
Article Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2007)
MATH Google Scholar
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2015)
Google Scholar
Cao, H., Govindaraju, V., Bhardwaj, A.: Unconstrained handwritten document retrieval. IJDAR 14, 145–157 (2011)
Article Google Scholar
Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Retrieving poorly degraded OCR documents. IJDAR 8, 15 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Visual Information Technology (CVIT) - International Institute of Information Technology Hyderabad, Hyderabad, India
Minesh Mathew & C. V. Jawahar
Computer Vision Centre (CVC) - Universitat Autónoma de Barcelona, Barcelona, Spain
Lluis Gomez & Dimosthenis Karatzas

Authors

Minesh Mathew
View author publications
You can also search for this author in PubMed Google Scholar
Lluis Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Dimosthenis Karatzas
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minesh Mathew.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mathew, M., Gomez, L., Karatzas, D. et al. Asking questions on handwritten document collections. IJDAR 24, 235–249 (2021). https://doi.org/10.1007/s10032-021-00383-3

Download citation

Received: 21 November 2020
Revised: 15 June 2021
Accepted: 16 June 2021
Published: 06 August 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10032-021-00383-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asking questions on handwritten document collections

Abstract

Access this article

Similar content being viewed by others

Recognition-Free Question Answering on Handwritten Document Collections

HWNet v2: an efficient word image representation for handwritten documents

Handwritten Text Retrieval from Unlabeled Collections

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Asking questions on handwritten document collections

Abstract

Access this article

Similar content being viewed by others

Recognition-Free Question Answering on Handwritten Document Collections

HWNet v2: an efficient word image representation for handwritten documents

Handwritten Text Retrieval from Unlabeled Collections

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation