Skip to main content
Log in

Closed-loop reasoning with graph-aware dense interaction for visual dialog

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)

    Article  Google Scholar 

  2. Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022)

  3. Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)

    Article  Google Scholar 

  4. Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021)

  5. Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)

    Article  Google Scholar 

  6. Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)

    Article  Google Scholar 

  7. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)

  8. Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)

    Article  Google Scholar 

  9. Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)

    Article  Google Scholar 

  10. Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018)

  11. Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019)

  12. Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)

  13. Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020)

  14. Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)

    Article  Google Scholar 

  15. Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021)

  16. Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021)

  17. Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)

    Article  MathSciNet  Google Scholar 

  18. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

  19. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)

  20. Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021)

  21. Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)

    Article  Google Scholar 

  22. Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021)

  23. Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021)

  24. Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019)

  25. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015)

  26. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019)

  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)

  28. Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)

  29. Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)

  30. Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019)

  31. Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)

    Article  Google Scholar 

  32. Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)

  33. Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019)

  34. Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020)

  35. Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019)

  36. Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019)

  37. Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016)

  38. Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018)

  39. Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019)

  40. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020)

  41. Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021)

  42. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

  43. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018)

  44. Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997)

  45. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  46. Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)

    Article  Google Scholar 

  47. Paddlepaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. http://www.paddlepaddle.org/

  48. Gao, J., Zhang, T., Yang, X., Xu, C.: Deep relative tracking. IEEE Trans. Image Process. 26(4), 1845–1858 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (U21B2024, 62002257), State Key Laboratory of Communication Content Cognition (Grant No. A02106), Open Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. 20K04), the China Postdoctoral Science Foundation (2021M692395) and the Baidu Program. Besides, we sincerely thank to the Baidu Program for the Paddlepaddle platform.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Xu.

Additional information

Communicated by B-K. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, AA., Zhang, G., Xu, N. et al. Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems 28, 1823–1832 (2022). https://doi.org/10.1007/s00530-022-00947-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00947-1

Keywords

Navigation