Grounded Situation Recognition

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)


We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imSitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at


Situation recognition Scene understanding Grounding 

Supplementary material

504439_1_En_19_MOESM1_ESM.pdf (6.5 mb)
Supplementary material 1 (pdf 6669 KB)


  1. 1.
    Agrawal, H., et al.: nocaps: novel object captioning at scale. In: International Conference on Computer Vision abs/1812.08658 (2019)Google Scholar
  2. 2.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)Google Scholar
  4. 4.
    Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.S.: Neural codes for image retrieval. arXiv abs/1404.1777 (2014)Google Scholar
  5. 5.
    Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 17th International Conference on Computational Linguistics - Volume 1, COLING 1998, pp. 86–90. Association for Computational Linguistics, Stroudsburg (1998).
  6. 6.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL (2005)Google Scholar
  7. 7.
    Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389 (2017)Google Scholar
  8. 8.
    Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025 (2015)Google Scholar
  9. 9.
    Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
  10. 10.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: BMVC (2010)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  12. 12.
    Dong, B., Collins, R., Hoogs, A.: Explainability for content-based image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019Google Scholar
  13. 13.
    Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)CrossRefGoogle Scholar
  14. 14.
    Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2009)CrossRefGoogle Scholar
  15. 15.
    Gella, S., Keller, F.: An analysis of action recognition datasets for language and vision tasks. arXiv abs/1704.07129 (2017)Google Scholar
  16. 16.
    Gella, S., Lapata, M., Keller, F.: Unsupervised visual sense disambiguation for verbs using multimodal embeddings. arXiv abs/1603.09188 (2016)Google Scholar
  17. 17.
    Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). Scholar
  18. 18.
    Guo, Y., Li, Y., Wang, S.: CS-R-FCN: cross-supervised learning for large-scale object detection. CoRR abs/1905.12863 (2020).
  19. 19.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)CrossRefGoogle Scholar
  20. 20.
    Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
  21. 21.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970 (2015)Google Scholar
  22. 22.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  23. 23.
    Ikizler, N., Cinbis, R.G., Pehlivan, S., Sahin, P.D.: Recognizing actions from still images. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008)Google Scholar
  24. 24.
    Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  25. 25.
    Kay, W., et al.: The kinetics human action video dataset. arXiv abs/1705.06950 (2017)Google Scholar
  26. 26.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L.: ReferitGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)Google Scholar
  27. 27.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) International Conference on Learning Representations (2015)Google Scholar
  28. 28.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2016). Scholar
  29. 29.
    Le, D.T., Bernardi, R., Uijlings, J.R.R.: Exploiting language models to recognize unseen actions. In: ICMR 2013 (2013)Google Scholar
  30. 30.
    Le, D.T., Uijlings, J.R.R., Bernardi, R.: TUHOI: Trento universal human object interaction dataset. In: VL@COLING (2014)Google Scholar
  31. 31.
    Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recognition with graph neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4173–4182 (2017)Google Scholar
  32. 32.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL 2004 (2004)Google Scholar
  33. 33.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection (2016)Google Scholar
  34. 34.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection (2017)Google Scholar
  35. 35.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  36. 36.
    Mallya, A., Lazebnik, S.: Recurrent models for situation recognition (2017)Google Scholar
  37. 37.
    Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20 (2015)Google Scholar
  38. 38.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2001)Google Scholar
  39. 39.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)Google Scholar
  40. 40.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)Google Scholar
  41. 41.
    Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). Scholar
  42. 42.
    Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: Visual instance retrieval with deep convolutional networks. CoRR abs/1412.6574 (2014)Google Scholar
  43. 43.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger (2016)Google Scholar
  44. 44.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates, Inc. (2015)Google Scholar
  45. 45.
    Ronchi, M.R., Perona, P.: Describing common human visual actions in images. arXiv preprint arXiv:1506.02203 (2015)
  46. 46.
    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)Google Scholar
  47. 47.
    Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-Ego: a large-scale dataset of paired third and first person videos. arXiv abs/1804.09626 (2018)Google Scholar
  48. 48.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. arXiv abs/1604.01753 (2016)Google Scholar
  49. 49.
    Singh, B., Li, H., Sharma, A., Davis, L.S.: R-FCN-3000 at 30fps: decoupling detection and classification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1081–1090 (2017)Google Scholar
  50. 50.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv abs/1212.0402 (2012)Google Scholar
  51. 51.
    Suhail, M., Sigal, L.: Mixture-kernel graph attention network for situation recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10363–10372 (2019)Google Scholar
  52. 52.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2014)Google Scholar
  53. 53.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2014)Google Scholar
  54. 54.
    de Vries, H., Strub, F., Chandar, A.P.S., Pietquin, O., Larochelle, H., Courville, A.C.: Guesswhat?! Visual object discovery through multi-modal dialogue. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4466–4475 (2016)Google Scholar
  55. 55.
    Yang, F., Hinami, R., Matsui, Y., Ly, S., Satoh, S.: Efficient image retrieval via decoupling diffusion into online and offline processing. In: AAAI (2018)Google Scholar
  56. 56.
    Yang, H., Wu, H., Chen, H.: Detecting 11k classes: large scale object detection without fine-grained bounding boxes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9804–9812 (2019)Google Scholar
  57. 57.
    Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J., Li, F.F.: Human action recognition by learning bases of action attributes and parts. In: 2011 International Conference on Computer Vision, pp. 1331–1338 (2011)Google Scholar
  58. 58.
    Yao, B., Li, F.F.: Grouplet: a structured image representation for recognizing human and object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)Google Scholar
  59. 59.
    Yatskar, M., Ordonez, V., Zettlemoyer, L., Farhadi, A.: Commonly uncommon: semantic sparsity in situation recognition (2016)Google Scholar
  60. 60.
    Yatskar, M., Zettlemoyer, L.S., Farhadi, A.: Situation recognition: visual semantic role labeling for image understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5534–5542 (2016)Google Scholar
  61. 61.
    Zhuang, B., Wu, Q., Shen, C., Reid, I.D., van den Hengel, A.: HCVRD: a benchmark for large-scale human-centered visual relationship detection. In: AAAI (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Allen Institute for AISeattleUSA
  2. 2.University of WashingtonSeattleUSA

Personalised recommendations