Reciprocal question representation learning network for visual dialog

Zhang, Hongwei; Wang, Xiaojie; Jiang, Si

doi:10.1007/s10489-022-03795-8

Reciprocal question representation learning network for visual dialog

Published: 16 June 2022

Volume 53, pages 4924–4939, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

289 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Visual dialog task entails an agent to answer a series of questions based on an image and the dialog history. Biases are often observed when the agent over relies on the dialog history. Thus, balanced usage of dialog history is crucial. Existing models usually drop several rounds of dialog history or learn a sparse dialog structure to address the overreliance on such history; however, bias might still exist in the selected dialog history. Therefore, we propose a new model, reciprocal question representation learning network (RQRLN), with less bias from dialog history by learning more accurate history-aware representations of questions. Initially, RQRLN adaptively selects favorable information at the token level from two representations of a question encoded with and without a dialog history. Later, the adaptive question representation is assembled with the corresponding image for the final decoder. We also used a new entropy loss function which further reduces the dialog history-based bias, enabling two different types of representations of the same token to learn interactively. Analysis results on the VisDial v1.0 dataset showed that our proposed model achieved state-of-the-art results in terms of normalized discounted cumulative gain (NDCG). We also demonstrate that our model shows lesser bias and infers more generic answers in comparison with models that use the entire history.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Visual attention network

Article Open access 28 July 2023

References

Das A, Kottur S, Gupta K, et al. (2019) Visual dialog. IEEE Trans Pattern Anal Mach Intell 41(5):1242–1256. https://doi.org/10.1109/TPAMI.2018.2828437
Article Google Scholar
Nguyen D, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00637 https://doi.org/10.1109/CVPR.2018.00637, pp 6087–6096
Sharma H, Jalal A S (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104,165. https://doi.org/10.1016/j.imavis.2021.104165 https://doi.org/10.1016/j.imavis.2021.104165
Article Google Scholar
Zhan H, Xiong P, Wang X et al (2022) Visual question answering by pattern matching and reasoning. Neurocomputing 467:323–336. https://doi.org/10.1016/j.neucom.2021.10.016
Article Google Scholar
Gan Z, Cheng Y, Kholy AE et al (2019) Multi-step reasoning via recurrent dual attention for visual dialog. In: Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1648, pp 6463–6474
Kang G, Lim J, Zhang B (2019) Dual attention networks for visual reference resolution in visual dialog. In: Inui K, Jiang J, Ng V et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1209, pp 2024–2033
Wu Q, Wang P, Shen C et al (2018) Are you talking to me? reasoned visual dialog generation through adversarial learning. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00639, pp 6106–6115
Zhang J, Wang Q, Han Y (2020) Multi-modal fusion with multi-level attention for visual dialog. Inf Process Manag 57(4):102–152. https://doi.org/10.1016/j.ipm.2019.102152
Article Google Scholar
Guo D, Wang H, Wang S et al (2020) Textual-visual reference-aware attention network for visual dialog. IEEE Trans Image Process 29:6655–6666. https://doi.org/10.1109/TIP.2020.2992888
Article MATH Google Scholar
Guo D, Wang H, Wang M (2021) Context-aware graph inference with knowledge distillation for visual dialog. IEEE Trans Pattern Anal Mach Intell PP(99):1–1. https://doi.org/10.1109/TPAMI.2021.3085755 https://doi.org/10.1109/TPAMI.2021.3085755
Google Scholar
Jiang T, Shao H, Tian X, et al. (2021) Aligning vision-language for graph inference in visual dialog. Image Vis Comput 116:104,316. https://doi.org/10.1016/j.imavis.2021.104316
Article Google Scholar
Schwartz I, Yu S, Hazan T et al (2019) Factor graph attention. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00214, pp 2039–2048
Zheng Z, Wang W, Qi S et al (2019) Reasoning visual dialogs with structural and partial observations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00683, pp 6669–6678
Murahari V, Batra D, Parikh D et al (2020) Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In: Vedaldi A, Bischof H, Brox T, et al. (eds) Computer Vision - ECCV 2020, - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVIII, Lecture Notes in Computer Science, vol 12363. Springer. https://doi.org/10.1007/978-3-030-58523-5_20, pp 336–352
Wang Y, Joty S R, Lyu M R et al (2020) VD-BERT: A unified vision and dialog transformer with BERT. In: Webber B, Cohn T, He Y, et al. (eds) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.269 https://doi.org/10.18653/v1/2020.emnlp-main.269, pp 3325–3338
Kang G, Park J, Lee H et al (2020) Dialgraph: Sparse graph learning networks for visual dialog. arXiv:2004.06698
Kim H, Tan H, Bansal M (2020) Modality-balanced models for visual dialogue. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press. https://aaai.org/ojs/index.php/AAAI/article/view/6320, pp 8091–8098
Qi J, Niu Y, Huang J et al (2020) Two causal principles for improving visual dialog. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR42600.2020.01087, pp 10,857–10,866
de Vries H, Strub F, Chandar S et al (2017) Guesswhat?! visual object discovery through multi-modal dialogue. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society. https://doi.org/10.1109/CVPR.2017.475, pp 4466–4475
Lu J, Kannan A, Yang J, et al. (2017) Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In: Guyon I, von Luxburg U, Bengio S et al (eds) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. https://proceedings.neurips.cc/paper/2017/hash/077e29b11be80ab57e1a2ecabb7da330-Abstract.html, pp 314–324
Guo D, Xu C, Tao D (2019) Image-question-answer synergistic network for visual dialog. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.01068 https://doi.org/10.1109/CVPR.2019.01068, pp 10,434–10,443
Yang T, Zha Z, Zhang H (2019) Making history matter: History-advantage sequence training for visual dialog. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE. https://doi.org/10.1109/ICCV.2019.00265, pp 2561–2569
Lin B, Zhu Y, Liang X (2021) Heterogeneous excitation-and-squeeze network for visual dialog. Neurocomputing 449:399–410. https://doi.org/P10.1016/j.neucom.2021.03.104
Article Google Scholar
Kottur S, Moura JMF, Parikh D, et al. (2018) Visual coreference resolution in visual dialog using neural module networks. In: Ferrari V, Hebert M, Sminchisescu C et al (eds) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, Lecture Notes in Computer Science, vol 11219. Springer. https://doi.org/10.1007/978-3-030-01267-0_10, pp 160–178
Niu Y, Zhang H, Zhang M et al (2019) Recursive visual attention in visual dialog. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00684, pp 6679–6688
Park S, Whang T, Yoon Y, et al. (2021) Multi-view attention networks for visual dialog. Appl Sci 11(7):399–410. https://doi.org/10.3390/app11073009
Article Google Scholar
Ren S, He K, Girshick R B et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Krishna R, Zhu Y, Groth O et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7 https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29. https://doi.org/10.3115/v1/d14-1162, vol 2014. A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, Doha, Qatar, pp 1532–1543
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: IEEE Conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE. https://doi.org/10.1109/CVPR.2019.00644 https://doi.org/10.1109/CVPR.2019.00644, pp 6281–6290
Lin T, Maire M, Belongie SJ et al (2014) Microsoft COCO: common objects in context. In: Fleet DJ, Pajdla T, Schiele B et al (eds) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Lecture Notes in Computer Science, vol 8693. Springer. https://doi.org/10.1007/978-3-319-10602-1_48, pp 740–755
Yang L, Meng F, Wu MD, et al. (2020) Seqdialn: Sequential visual dialog networks in joint visual-linguistic representation space. arXiv:2008.00397
Agarwal S, Bui T, Lee J et al (2020) History for visual dialog: Do we really need it?. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.728, pp 8182–8197
Guo D, Wang H, Wang M (2019) Dual visual attention network for visual dialog. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. ijcai.org. https://doi.org/10.24963/ijcai.2019/693, pp 4989–4995
Jiang X, Yu J, Qin Z et al (2020) Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press. https://aaai.org/ojs/index.php/AAAI/article/view/6769, pp 11,125–11,132
Lu J, Batra D, Parikh D et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html. Vancouver, BC, Canada, pp 13–23

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions. This paper is supported by National Natural Science Foundation of China (No. 62076032).

Author information

Authors and Affiliations

Center for Intelligence Science and Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Hongwei Zhang & Xiaojie Wang
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Si Jiang

Authors

Hongwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Si Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongwei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Wang, X. & Jiang, S. Reciprocal question representation learning network for visual dialog. Appl Intell 53, 4924–4939 (2023). https://doi.org/10.1007/s10489-022-03795-8

Download citation

Accepted: 21 May 2022
Published: 16 June 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10489-022-03795-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reciprocal question representation learning network for visual dialog

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reciprocal question representation learning network for visual dialog

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation