Grounded Situation Recognition

Pratt, Sarah; Yatskar, Mark; Weihs, Luca; Farhadi, Ali; Kembhavi, Aniruddha

doi:10.1007/978-3-030-58548-8_19

Sarah Pratt¹²,
Mark Yatskar¹²,
Luca Weihs¹²,
Ali Farhadi¹³ &
…
Aniruddha Kembhavi^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12349))

Included in the following conference series:

European Conference on Computer Vision

5131 Accesses
24 Citations

Abstract

We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imSitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at https://prior.allenai.org/projects/gsr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, H., et al.: nocaps: novel object captioning at scale. In: International Conference on Computer Vision abs/1812.08658 (2019)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.S.: Neural codes for image retrieval. arXiv abs/1404.1777 (2014)
Google Scholar
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 17th International Conference on Computational Linguistics - Volume 1, COLING 1998, pp. 86–90. Association for Computational Linguistics, Stroudsburg (1998). https://doi.org/10.3115/980451.980860
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL (2005)
Google Scholar
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389 (2017)
Google Scholar
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1017–1025 (2015)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: BMVC (2010)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Dong, B., Collins, R., Hoogs, A.: Explainability for content-based image retrieval. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019
Google Scholar
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Article Google Scholar
Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88, 303–338 (2009)
Article Google Scholar
Gella, S., Keller, F.: An analysis of action recognition datasets for language and vision tasks. arXiv abs/1704.07129 (2017)
Google Scholar
Gella, S., Lapata, M., Keller, F.: Unsupervised visual sense disambiguation for verbs using multimodal embeddings. arXiv abs/1603.09188 (2016)
Google Scholar
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Chapter Google Scholar
Guo, Y., Li, Y., Wang, S.: CS-R-FCN: cross-supervised learning for large-scale object detection. CoRR abs/1905.12863 (2020). http://arxiv.org/abs/1905.12863
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)
Article Google Scholar
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Ikizler, N., Cinbis, R.G., Pehlivan, S., Sahin, P.D.: Recognizing actions from still images. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008)
Google Scholar
Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv abs/1705.06950 (2017)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L.: ReferitGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) International Conference on Learning Representations (2015)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2016). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Le, D.T., Bernardi, R., Uijlings, J.R.R.: Exploiting language models to recognize unseen actions. In: ICMR 2013 (2013)
Google Scholar
Le, D.T., Uijlings, J.R.R., Bernardi, R.: TUHOI: Trento universal human object interaction dataset. In: VL@COLING (2014)
Google Scholar
Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recognition with graph neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4173–4182 (2017)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL 2004 (2004)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection (2016)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Mallya, A., Lazebnik, S.: Recurrent models for situation recognition (2017)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20 (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2001)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_1
Chapter Google Scholar
Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: Visual instance retrieval with deep convolutional networks. CoRR abs/1412.6574 (2014)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates, Inc. (2015)
Google Scholar
Ronchi, M.R., Perona, P.: Describing common human visual actions in images. arXiv preprint arXiv:1506.02203 (2015)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-Ego: a large-scale dataset of paired third and first person videos. arXiv abs/1804.09626 (2018)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. arXiv abs/1604.01753 (2016)
Google Scholar
Singh, B., Li, H., Sharma, A., Davis, L.S.: R-FCN-3000 at 30fps: decoupling detection and classification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1081–1090 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv abs/1212.0402 (2012)
Google Scholar
Suhail, M., Sigal, L.: Mixture-kernel graph attention network for situation recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10363–10372 (2019)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2014)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2014)
Google Scholar
de Vries, H., Strub, F., Chandar, A.P.S., Pietquin, O., Larochelle, H., Courville, A.C.: Guesswhat?! Visual object discovery through multi-modal dialogue. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4466–4475 (2016)
Google Scholar
Yang, F., Hinami, R., Matsui, Y., Ly, S., Satoh, S.: Efficient image retrieval via decoupling diffusion into online and offline processing. In: AAAI (2018)
Google Scholar
Yang, H., Wu, H., Chen, H.: Detecting 11k classes: large scale object detection without fine-grained bounding boxes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9804–9812 (2019)
Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J., Li, F.F.: Human action recognition by learning bases of action attributes and parts. In: 2011 International Conference on Computer Vision, pp. 1331–1338 (2011)
Google Scholar
Yao, B., Li, F.F.: Grouplet: a structured image representation for recognizing human and object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)
Google Scholar
Yatskar, M., Ordonez, V., Zettlemoyer, L., Farhadi, A.: Commonly uncommon: semantic sparsity in situation recognition (2016)
Google Scholar
Yatskar, M., Zettlemoyer, L.S., Farhadi, A.: Situation recognition: visual semantic role labeling for image understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5534–5542 (2016)
Google Scholar
Zhuang, B., Wu, Q., Shen, C., Reid, I.D., van den Hengel, A.: HCVRD: a benchmark for large-scale human-centered visual relationship detection. In: AAAI (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Allen Institute for AI, Seattle, USA
Sarah Pratt, Mark Yatskar, Luca Weihs & Aniruddha Kembhavi
University of Washington, Seattle, USA
Ali Farhadi & Aniruddha Kembhavi

Authors

Sarah Pratt
View author publications
You can also search for this author in PubMed Google Scholar
Mark Yatskar
View author publications
You can also search for this author in PubMed Google Scholar
Luca Weihs
View author publications
You can also search for this author in PubMed Google Scholar
Ali Farhadi
View author publications
You can also search for this author in PubMed Google Scholar
Aniruddha Kembhavi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sarah Pratt , Mark Yatskar , Luca Weihs , Ali Farhadi or Aniruddha Kembhavi .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6669 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., Kembhavi, A. (2020). Grounded Situation Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-58548-8_19
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58547-1
Online ISBN: 978-3-030-58548-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics