Guessing State Tracking for Visual Dialogue

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)


The Guesser is a task of visual grounding in GuessWhat?! like visual dialogue. It locates the target object in an image supposed by an Oracle oneself over a question-answer based dialogue between a Questioner and the Oracle. Most existing guessers make one and only one guess after receiving all question-answer pairs in a dialogue with the predefined number of rounds. This paper proposes a guessing state for the Guesser, and regards guess as a process with change of guessing state through a dialogue. A guessing state tracking based guess model is therefore proposed. The guessing state is defined as a distribution on objects in the image. With that in hand, two loss functions are defined as supervisions to guide the guessing state in model training. Early supervision brings supervision to Guesser at early rounds, and incremental supervision brings monotonicity to the guessing state. Experimental results on GuessWhat?! dataset show that our model significantly outperforms previous models, achieves new state-of-the-art, especially the success rate of guessing 83.3% is approaching the human-level accuracy of 84.4%.


Visual dialogue Visual grounding Guessing state tracking GuessWhat?! 



We thank the reviewers for their comments and suggestions. This paper is supported by NSFC (No. 61906018), Huawei Noah’s Ark Lab and MoE-CMCC “Artificial Intelligence” Project (No. MCM20190701).


  1. 1.
    Abbasnejad, E., Wu, Q., Abbasnejad, I., Shi, J., van den Hengel, A.: An active information seeking model for goal-oriented vision-and-language tasks. arXiv preprint arXiv:1812.06398 (2018)
  2. 2.
    Abbasnejad, E., Wu, Q., Shi, J., van den Hengel, A.: What’s to know? Uncertainty as a guide to asking goal-oriented questions. In: CVPR (2019)Google Scholar
  3. 3.
    Bani, G., et al.: Adding object detection skills to visual dialogue agents. In: ECCV (2018)Google Scholar
  4. 4.
    Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-ai games. In: HCOMP (2017)Google Scholar
  5. 5.
    Das, A., et al.: Visual dialog. In: CVPR (2017)Google Scholar
  6. 6.
    Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: CVPR (2018)Google Scholar
  7. 7.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Kim, H., Tan, H., Bansal, M.: Modality-balanced models for visual dialogue. In: AAAI (2020)Google Scholar
  10. 10.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  11. 11.
    Lee, S.W., Heo, Y.J., Zhang, B.T.: Answerer in questioner’s mind: information theoretic approach to goal-oriented visual dialog. In: NeurIPS (2018)Google Scholar
  12. 12.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR, pp. 11–20 (2016)Google Scholar
  13. 13.
    Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (2020)Google Scholar
  14. 14.
    Seo, P.H., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. In: NeurIPS (2017)Google Scholar
  15. 15.
    Serban, I., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Hierarchical neural network generative models for movie dialogues. In: arXiv preprint arXiv:1507.04808 (2015)
  16. 16.
    Shekhar, R., Venkatesh, A., Baumgärtner, T., Bruni, E., Plank, B., Bernardi, R., Fernández, R.: Ask no more: deciding when to guess in referential visual dialogue. In: COLING (2018)Google Scholar
  17. 17.
    Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guesswhat. In: NAACL (2019)Google Scholar
  18. 18.
    Shukla, P., Elmadjian, C., Sharan, R., Kulkarni, V., Wang, W.Y., Turk, M.: What should I ask? Using conversationally informative rewards for goal-oriented visual dialogue. In: ACL (2019)Google Scholar
  19. 19.
    Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. In: IJCAI (2017)Google Scholar
  20. 20.
    de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.C.: Guesswhat?! Visual object discovery through multi-modal dialogue. In: CVPR (2017)Google Scholar
  21. 21.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992). Scholar
  22. 22.
    Xiao, F., Sigal, L., Lee, Y.J.: Weakly-supervised visual grounding of phrases with linguistic structures. In: CVPR (2017)Google Scholar
  23. 23.
    Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: ICCV (2019)Google Scholar
  24. 24.
    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). Scholar
  25. 25.
    Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: CVPR (2017)Google Scholar
  26. 26.
    Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., van den Hengel, A.: Asking the difficult questions: goal-oriented visual question generation via intermediate rewards. In: ECCV (2018)Google Scholar
  27. 27.
    Zhao, R., Tresp, V.: Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient. In: IJCAI (2018)Google Scholar
  28. 28.
    Zhao, R., Tresp, V.: Efficient visual dialog policy learning via positive memory retention. In: NeurIPS (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Center for Intelligence Science and Technology, School of Computer ScienceBeijing University of Posts and TelecommunicationsBeijingChina

Personalised recommendations