Skip to main content

Document Collection Visual Question Answering

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12822)

Abstract

Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed over a whole collection of document images and the goal is not only to provide the answer to the given question, but also to retrieve the set of documents that contain the information needed to infer the answer. Along with the dataset we propose a new evaluation metric and baselines which provide further insights to the new dataset and task.

Keywords

  • Document collection
  • Visual Question Answering

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-86331-9_50
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-86331-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)

    CrossRef  Google Scholar 

  2. Amazon: Amazon textract (2021). https://aws.amazon.com/es/textract/. Accessed 11 Jan 2021

  3. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  4. Bansal, A., Zhang, Y., Chellappa, R.: Visual question answering on image sets. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXI. LNCS, vol. 12366, pp. 51–67. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_4

    CrossRef  Google Scholar 

  5. Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)

    Google Scholar 

  6. Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4291–4301 (2019)

    Google Scholar 

  7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    CrossRef  Google Scholar 

  8. Coüasnon, B., Lemaitre, A.: Recognition of tables and forms (2014)

    Google Scholar 

  9. Dengel, A.R., Klein, B.: smartFIX: a requirements-driven system for document analysis and understanding. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 433–444. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_47

    CrossRef  MATH  Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ACL (2019)

    Google Scholar 

  11. Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on CVPR, pp. 1999–2007 (2019)

    Google Scholar 

  12. Google: Google OCR (2020). https://cloud.google.com/solutions/document-ai. Accessed 10 Dec 2020

  13. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  14. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)

    CrossRef  Google Scholar 

  15. Kafle, K., Price, B., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656 (2018)

    Google Scholar 

  16. Kahou, S.E., Michalski, V., Atkinson, A., Kádár, Á., Trischler, A., Bengio, Y.: FigureQA: an annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017)

  17. Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 1–6 (2018)

    Google Scholar 

  18. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics doklady, pp. 707–710. Soviet Union (1966)

    Google Scholar 

  19. Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009)

    CrossRef  Google Scholar 

  20. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter on Computational Linguistics, pp. 32–39 (2019)

    Google Scholar 

  21. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. arXiv preprint arXiv:1410.0210 (2014)

  22. Manmatha, R., Croft, W.: Word spotting: indexing handwritten archives. In: Intelligent Multimedia Information Retrieval Collection, pp. 43–64 (1997)

    Google Scholar 

  23. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF WACV, pp. 2200–2209 (2021)

    Google Scholar 

  24. Mathew, M., Tito, R., Karatzas, D., Manmatha, R., Jawahar, C.: Document visual question answering challenge 2020. arXiv e-prints, pp. arXiv-2008 (2020)

    Google Scholar 

  25. Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U.: ICFHR 2016 CROHME: competition on recognition of online handwritten mathematical expressions. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) (2016)

    Google Scholar 

  26. Palm, R.B., Winther, O., Laws, F.: Cloudscan - a configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 406–413 (2017)

    Google Scholar 

  27. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, vol. 2, pp. II-II. IEEE (2003)

    Google Scholar 

  28. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. arXiv preprint arXiv:1505.02074 (2015)

  29. Schuster, D., et al.: Intellix - end-user trained information extraction for document archiving. In: 2013 12th ICDAR (2013)

    Google Scholar 

  30. Siegel, N., Horvitz, Z., Levin, R., Divvala, S., Farhadi, A.: FigureSeer: parsing result-figures in research papers. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 664–680. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_41

    CrossRef  Google Scholar 

  31. Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  32. Sudholt, S., Fink, G.A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 493–498 (2017)

    Google Scholar 

  33. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  34. Wilkinson, T., Lindström, J., Brun, A.: Neural ctrl-f: segmentation-free query-by-string word spotting in handwritten manuscript collections. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4443–4452 (2017)

    Google Scholar 

  35. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics (2020)

    Google Scholar 

  36. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD 2020, pp. 1192–1200 (2020)

    Google Scholar 

  37. Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding, pp. 1413–1422 (2020)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the UAB PIF scholarship B18P0070 and the Consolidated Research Group 2017-SGR-1783 from the Research and University Department of the Catalan Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rubèn Tito .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Tito, R., Karatzas, D., Valveny, E. (2021). Document Collection Visual Question Answering. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)