Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

Tsiaousis, Michail; Burghouts, Gertjan; Hillerström, Fieke; van der Putten, Peter

doi:10.1007/978-3-030-68799-1_9

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12664))

Included in the following conference series:

International Conference on Pattern Recognition

2266 Accesses
2 Altmetric

Abstract

The dominant paradigm in spatiotemporal action detection is to classify actions using spatiotemporal features learned by 2D or 3D Convolutional Networks. We argue that several actions are characterized by their context, such as relevant objects and actors present in the video. To this end, we introduce an architecture based on self-attention and Graph Convolutional Networks in order to model contextual cues, such as actor-actor and actor-object interactions, to improve human action detection in video. We are interested in achieving this in a weakly-supervised setting, i.e. using as less annotations as possible in terms of action bounding boxes. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. We evaluate how well our model highlights the relevant context by introducing a quantitative metric based on recall of objects retrieved by attention maps. Our model relies on a 3D convolutional RGB stream, and does not require expensive optical flow computation. We evaluate our models on the DALY dataset, which consists of human-object interaction actions. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP. Code is available at https://github.com/micts/acgcn.

M. Tsiaousis—This work was carried out during an internship at TNO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LGAFormer: transformer with local and global attention for action detection

Article 06 May 2024

Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

Article Open access 18 July 2022

ActionFormer: Localizing Moments of Actions with Transformers

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Google Scholar
Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. ArXiv abs/1607.06450 (2016)
Google Scholar
van Boven, B., van der Putten, P., Åström, A., Khalafi, H., Plaat, A.: Real-time excavation detection at construction sites using deep learning. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds.) IDA 2018. LNCS, vol. 11191, pp. 340–352. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01768-2_28
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)
Google Scholar
Chéron, G., Alayrac, J.B., Laptev, I., Schmid, C.: A flexible model for training action localization with varying levels of supervision. In: Advances in Neural Information Processing Systems 31, pp. 942–953. Curran Associates, Inc. (2018)
Google Scholar
Chesneau, N., Rogez, G., Alahari, K., Schmid, C.: Detecting parts for action localization. ArXiv abs/1707.06005 (2017)
Google Scholar
Girdhar, R., João Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)
Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Gkioxari, G., Malik, J.: Finding action tubes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 759–768 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, OpenReview.net (2017)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Mettes, P., Snoek, C.G.: Pointly-supervised action localization. Int. J. Comput. Vision 127(3), 263–281 (2019)
Article Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems 30, pp. 4967–4976. Curran Associates, Inc. (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 27, pp. 568–576. Curran Associates, Inc. (2014)
Google Scholar
Siva, P., Xiang, T.: Weakly supervised action detection. In: Proceedings of the British Machine Vision Conference. BMVA Press (2011)
Google Scholar
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 335–351. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_20
Chapter Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Ulutan, O., Rallapalli, S., Srivatsa, M., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 516–525 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. ArXiv abs/1710.10903 (2018)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Chapter Google Scholar
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: 2015 IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)
Google Scholar
Weinzaepfel, P., Martin, X., Schmid, C.: Towards weakly-supervised action localization. ArXiv abs/1605.05197 (2016)
Google Scholar
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9956–9966 (2019)
Google Scholar
Zhang, Y., Tokmakov, P., Hebert, M., Schmid, C.: A structured model for action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9967–9976 (2019)
Google Scholar

Download references

Acknowledgements

We would like to thank Philippe Weinzaepfel for providing us with the predicted action tubes of their tracking-by-detection model.

Author information

Authors and Affiliations

Leiden University, Niels Bohrweg 1, 2333, Leiden, CA, The Netherlands
Michail Tsiaousis & Peter van der Putten
TNO, Oude Waalsdorperweg 63, 2597, The Hague, AK, The Netherlands
Gertjan Burghouts & Fieke Hillerström

Authors

Michail Tsiaousis
View author publications
You can also search for this author in PubMed Google Scholar
Gertjan Burghouts
View author publications
You can also search for this author in PubMed Google Scholar
Fieke Hillerström
View author publications
You can also search for this author in PubMed Google Scholar
Peter van der Putten
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michail Tsiaousis .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell'Informazione, University of Firenze, Florence, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsiaousis, M., Burghouts, G., Hillerström, F., van der Putten, P. (2021). Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12664. Springer, Cham. https://doi.org/10.1007/978-3-030-68799-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-68799-1_9
Published: 05 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68798-4
Online ISBN: 978-3-030-68799-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

Abstract

Access this chapter

Similar content being viewed by others

LGAFormer: transformer with local and global attention for action detection

Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

ActionFormer: Localizing Moments of Actions with Transformers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

Abstract

Access this chapter

Similar content being viewed by others

LGAFormer: transformer with local and global attention for action detection

Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

ActionFormer: Localizing Moments of Actions with Transformers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation