VisualCOMET: Reasoning About the Dynamic Context of a Still Image

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualCOMET, (Visual Commonsense Reasoning in Time.) the novel framework of visual commonsense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 59,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives.



This research was supported in part by NSF (IIS1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1-0543), DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031), and gifts from Allen Institute for Artificial Intelligence.

Supplementary material

504441_1_En_30_MOESM1_ESM.pdf (3.8 mb)
Supplementary material 1 (pdf 3877 KB)


  1. 1.
    Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123, 4–31 (2015)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)Google Scholar
  3. 3.
    Bhagavatula, C., et al.: Abductive commonsense reasoning. In: International Conference on Learning Representations (2020).
  4. 4.
    Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4762–4779. Association for Computational Linguistics, Florence (2019).
  5. 5.
    Castrejón, L., Ballas, N., Courville, A.C.: Improved VRNNs for video prediction. In: ICCV (2019)Google Scholar
  6. 6.
    Chao, Y.W., Yang, J., Price, B.L., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR (2017)Google Scholar
  7. 7.
    Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv (2015)Google Scholar
  8. 8.
    Chen, Y.C., et al.: Uniter: learning universal image-text representations. arXiv (2019)Google Scholar
  9. 9.
    Das, A., et al.: Visual dialog. In: CVPR (2017)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  11. 11.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)Google Scholar
  12. 12.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)Google Scholar
  13. 13.
    He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  15. 15.
    Holtzman, A., Buys, J., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv (2019)Google Scholar
  16. 16.
    Johnson, J.E., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)Google Scholar
  17. 17.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)Google Scholar
  18. 18.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv (2014)Google Scholar
  19. 19.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  20. 20.
    Lavie, M.D.A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2014)Google Scholar
  21. 21.
    Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)Google Scholar
  22. 22.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)Google Scholar
  23. 23.
    Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)Google Scholar
  24. 24.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2016)Google Scholar
  25. 25.
    Mottaghi, R., Rastegari, M., Gupta, A., Farhadi, A.: “What happens if...” learning to predict the effect of forces in images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 269–285. Springer, Cham (2016). Scholar
  26. 26.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2002)Google Scholar
  27. 27.
    Pirsiavash, H., Vondrick, C., Torralba, A.: Inferring the why in images. arXiv (2014)Google Scholar
  28. 28.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IJCV (2015)Google Scholar
  29. 29.
    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8) (2019)Google Scholar
  30. 30.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv (2014)Google Scholar
  31. 31.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  32. 32.
    Sap, M., et al.: Atomic: an atlas of machine commonsense for if-then reasoning. In: Proceedings of the Conference on Artificial Intelligence (AAAI) (2019)Google Scholar
  33. 33.
    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2018)Google Scholar
  34. 34.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMS. In: ICML (2015)Google Scholar
  35. 35.
    Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. In: ICLR (2020)Google Scholar
  36. 36.
    Sun, C., Shrivastava, A., Vondrick, C., Sukthankar, R., Murphy, K., Schmid, C.: Relational action forecasting. In: CVPR (2019)Google Scholar
  37. 37.
    Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: EMNLP (2019)Google Scholar
  38. 38.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS) (2017)Google Scholar
  39. 39.
    Vedantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D.: Learning common sense through visual abstraction. In: ICCV (2015)Google Scholar
  40. 40.
    Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  41. 41.
    Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: NeurIPS (2019)Google Scholar
  42. 42.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  43. 43.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)Google Scholar
  44. 44.
    Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: CVPR (2014)Google Scholar
  45. 45.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)Google Scholar
  46. 46.
    Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: NeurIPS (2016)Google Scholar
  47. 47.
    Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR (2019)Google Scholar
  48. 48.
    Zellers, R., et al.: Defending against neural fake news. In: Advances in Neural Information Processing Systems (NIPS) (2019)Google Scholar
  49. 49.
    Zhou, L., Hamid, P., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and question answering. In: Proceedings of the Conference on Artificial Intelligence (AAAI) (2020)Google Scholar
  50. 50.
    Zhou, Y., Berg, T.L.: Temporal perception and prediction in ego-centric video. In: ICCV (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Paul G. Allen School of Computer Science and EngineeringSeattleUSA
  2. 2.Allen Institute for Artificial IntelligenceSeattleUSA

Personalised recommendations