Skip to main content
Log in

Reciprocal question representation learning network for visual dialog

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Visual dialog task entails an agent to answer a series of questions based on an image and the dialog history. Biases are often observed when the agent over relies on the dialog history. Thus, balanced usage of dialog history is crucial. Existing models usually drop several rounds of dialog history or learn a sparse dialog structure to address the overreliance on such history; however, bias might still exist in the selected dialog history. Therefore, we propose a new model, reciprocal question representation learning network (RQRLN), with less bias from dialog history by learning more accurate history-aware representations of questions. Initially, RQRLN adaptively selects favorable information at the token level from two representations of a question encoded with and without a dialog history. Later, the adaptive question representation is assembled with the corresponding image for the final decoder. We also used a new entropy loss function which further reduces the dialog history-based bias, enabling two different types of representations of the same token to learn interactively. Analysis results on the VisDial v1.0 dataset showed that our proposed model achieved state-of-the-art results in terms of normalized discounted cumulative gain (NDCG). We also demonstrate that our model shows lesser bias and infers more generic answers in comparison with models that use the entire history.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Das A, Kottur S, Gupta K, et al. (2019) Visual dialog. IEEE Trans Pattern Anal Mach Intell 41(5):1242–1256. https://doi.org/10.1109/TPAMI.2018.2828437

    Article  Google Scholar 

  2. Nguyen D, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00637https://doi.org/10.1109/CVPR.2018.00637, pp 6087–6096

  3. Sharma H, Jalal A S (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104,165. https://doi.org/10.1016/j.imavis.2021.104165https://doi.org/10.1016/j.imavis.2021.104165

    Article  Google Scholar 

  4. Zhan H, Xiong P, Wang X et al (2022) Visual question answering by pattern matching and reasoning. Neurocomputing 467:323–336. https://doi.org/10.1016/j.neucom.2021.10.016

    Article  Google Scholar 

  5. Gan Z, Cheng Y, Kholy AE et al (2019) Multi-step reasoning via recurrent dual attention for visual dialog. In: Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1648, pp 6463–6474

  6. Kang G, Lim J, Zhang B (2019) Dual attention networks for visual reference resolution in visual dialog. In: Inui K, Jiang J, Ng V et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1209, pp 2024–2033

  7. Wu Q, Wang P, Shen C et al (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00639, pp 6106–6115

  8. Zhang J, Wang Q, Han Y (2020) Multi-modal fusion with multi-level attention for visual dialog. Inf Process Manag 57(4):102–152. https://doi.org/10.1016/j.ipm.2019.102152

    Article  Google Scholar 

  9. Guo D, Wang H, Wang S et al (2020) Textual-visual reference-aware attention network for visual dialog. IEEE Trans Image Process 29:6655–6666. https://doi.org/10.1109/TIP.2020.2992888

    Article  MATH  Google Scholar 

  10. Guo D, Wang H, Wang M (2021) Context-aware graph inference with knowledge distillation for visual dialog. IEEE Trans Pattern Anal Mach Intell PP(99):1–1. https://doi.org/10.1109/TPAMI.2021.3085755https://doi.org/10.1109/TPAMI.2021.3085755

    Google Scholar 

  11. Jiang T, Shao H, Tian X, et al. (2021) Aligning vision-language for graph inference in visual dialog. Image Vis Comput 116:104,316. https://doi.org/10.1016/j.imavis.2021.104316

    Article  Google Scholar 

  12. Schwartz I, Yu S, Hazan T et al (2019) Factor graph attention. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00214, pp 2039–2048

  13. Zheng Z, Wang W, Qi S et al (2019) Reasoning visual dialogs with structural and partial observations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00683, pp 6669–6678

  14. Murahari V, Batra D, Parikh D et al (2020) Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In: Vedaldi A, Bischof H, Brox T, et al. (eds) Computer Vision - ECCV 2020, - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVIII, Lecture Notes in Computer Science, vol 12363. Springer. https://doi.org/10.1007/978-3-030-58523-5_20, pp 336–352

  15. Wang Y, Joty S R, Lyu M R et al (2020) VD-BERT: A unified vision and dialog transformer with BERT. In: Webber B, Cohn T, He Y, et al. (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.269https://doi.org/10.18653/v1/2020.emnlp-main.269, pp 3325–3338

  16. Kang G, Park J, Lee H et al (2020) Dialgraph: Sparse graph learning networks for visual dialog. arXiv:2004.06698

  17. Kim H, Tan H, Bansal M (2020) Modality-balanced models for visual dialogue. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press. https://aaai.org/ojs/index.php/AAAI/article/view/6320, pp 8091–8098

  18. Qi J, Niu Y, Huang J et al (2020) Two causal principles for improving visual dialog. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR42600.2020.01087, pp 10,857–10,866

  19. de Vries H, Strub F, Chandar S et al (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society. https://doi.org/10.1109/CVPR.2017.475, pp 4466–4475

  20. Lu J, Kannan A, Yang J, et al. (2017) Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In: Guyon I, von Luxburg U, Bengio S et al (eds) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. https://proceedings.neurips.cc/paper/2017/hash/077e29b11be80ab57e1a2ecabb7da330-Abstract.html, pp 314–324

  21. Guo D, Xu C, Tao D (2019) Image-question-answer synergistic network for visual dialog. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.01068https://doi.org/10.1109/CVPR.2019.01068, pp 10,434–10,443

  22. Yang T, Zha Z, Zhang H (2019) Making history matter: History-advantage sequence training for visual dialog. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE. https://doi.org/10.1109/ICCV.2019.00265, pp 2561–2569

  23. Lin B, Zhu Y, Liang X (2021) Heterogeneous excitation-and-squeeze network for visual dialog. Neurocomputing 449:399–410. https://doi.org/P10.1016/j.neucom.2021.03.104

    Article  Google Scholar 

  24. Kottur S, Moura JMF, Parikh D, et al. (2018) Visual coreference resolution in visual dialog using neural module networks. In: Ferrari V, Hebert M, Sminchisescu C et al (eds) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, Lecture Notes in Computer Science, vol 11219. Springer. https://doi.org/10.1007/978-3-030-01267-0_10, pp 160–178

  25. Niu Y, Zhang H, Zhang M et al (2019) Recursive visual attention in visual dialog. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00684, pp 6679–6688

  26. Park S, Whang T, Yoon Y, et al. (2021) Multi-view attention networks for visual dialog. Appl Sci 11(7):399–410. https://doi.org/10.3390/app11073009

    Article  Google Scholar 

  27. Ren S, He K, Girshick R B et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  28. Krishna R, Zhu Y, Groth O et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  29. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29. https://doi.org/10.3115/v1/d14-1162, vol 2014. A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, Doha, Qatar, pp 1532–1543

  30. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  31. Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: IEEE Conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00644https://doi.org/10.1109/CVPR.2019.00644, pp 6281–6290

  32. Lin T, Maire M, Belongie SJ et al (2014) Microsoft COCO: common objects in context. In: Fleet DJ, Pajdla T, Schiele B et al (eds) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Lecture Notes in Computer Science, vol 8693. Springer. https://doi.org/10.1007/978-3-319-10602-1_48, pp 740–755

  33. Yang L, Meng F, Wu MD, et al. (2020) Seqdialn: Sequential visual dialog networks in joint visual-linguistic representation space. arXiv:2008.00397

  34. Agarwal S, Bui T, Lee J et al (2020) History for visual dialog: Do we really need it?. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.728, pp 8182–8197

  35. Guo D, Wang H, Wang M (2019) Dual visual attention network for visual dialog. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. ijcai.org. https://doi.org/10.24963/ijcai.2019/693, pp 4989–4995

  36. Jiang X, Yu J, Qin Z et al (2020) Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press. https://aaai.org/ojs/index.php/AAAI/article/view/6769, pp 11,125–11,132

  37. Lu J, Batra D, Parikh D et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html. Vancouver, BC, Canada, pp 13–23

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions. This paper is supported by National Natural Science Foundation of China (No. 62076032).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Wang, X. & Jiang, S. Reciprocal question representation learning network for visual dialog. Appl Intell 53, 4924–4939 (2023). https://doi.org/10.1007/s10489-022-03795-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03795-8

Keywords

Navigation