Still image action recognition based on interactions between joints and objects

Ashrafi, Seyed Sajad; Shokouhi, Shahriar B.; Ayatollahi, Ahmad

doi:10.1007/s11042-023-14350-z

Still image action recognition based on interactions between joints and objects

Published: 10 January 2023

Volume 82, pages 25945–25971, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Seyed Sajad Ashrafi¹,
Shahriar B. Shokouhi ORCID: orcid.org/0000-0001-6266-6607¹ &
Ahmad Ayatollahi¹

296 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Still image-based action recognition is a challenging area in which recognition is performed based on only a single input image. Utilizing auxiliary information such as pose, object, or background is one of the common techniques in this field. However, the simultaneous use of several auxiliary components and their optimal combinations is less studied. In this study, two cues of body joints and objects have been employed simultaneously, and an attention module is proposed to combine the features of these two components. The attention module consists of two self-attentions and a cross-attention, which are designed to account for the interaction between the objects, between the joints, and between the joints and objects, respectively. In addition, the Multi-scale Atrous Spatial Pyramid Pooling (MASPP) module is proposed to reduce the number of parameters of the proposed method and at the same time, combine the features obtained from different levels of the backbone. The Joint Object Pooling (JOPool) module is proposed to extract local features from joints and objects regions. ResNets are used as the backbone, and the stride of the last two layers is changed. Experimental results on different datasets show that the combination of several auxiliary components can be effective in increasing the mean Average Precision (mAP) of recognition. The proposed method is evaluated on three important datasets: Stanford-40, PASCAL VOC 2012, and BU101PLUS resulting in 94.84%, 93.20%, and 91.25% mAPs, respectively. The obtained mAPs are higher than the best preceding proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition

Article 11 December 2023

Attention Focused Spatial Pyramid Pooling for Boxless Action Recognition in Still Images

Multi-modality learning for human action recognition

Article 02 March 2020

Data availability

The data that support the findings of this study are available in the reference numbers [2, 13, 49]. These data were derived from the following resources available in the public domain:

• Stanford-40 Actions: http://vision.stanford.edu/Datasets/40actions.html

• PASCAL VOC 2012: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

• BU101PLUS: https://github.com/seyedsajadashrafi/bu101plus-action-recognition-dataset

References

Akti S, Ofli F, Imran M, Ekenel HK (2021) “Fight Detection from Still Images in the Wild,” Proc. - 2022 IEEE/CVF Winter Conf. Appl. Comput. Vis. Work. WACVW 2022, pp. 550–559, https://doi.org/10.48550/arxiv.2111.08370
Ashrafi SS, Shokouhi SB, Ayatollahi A (Jul. 2021) Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection. Multimed Tools Appl 2021:1–27. https://doi.org/10.1007/S11042-021-11215-1
Article Google Scholar
Beddiar DR, Nini B, Sabokrou M, Hadid A (2020) Vision-based human activity recognition: a survey. Multimed Tools Appl 79:1–47. https://doi.org/10.1007/s11042-020-09004-3
Article Google Scholar
Cao Y, Liu C, Huang Z, Sheng Y, Ju Y (Jun. 2021) Skeleton-based action recognition with temporal action graph and temporal adaptive graph convolution structure. Multimed Tools Appl 2021:1–24. https://doi.org/10.1007/S11042-021-11136-Z
Article Google Scholar
Chakraborty S, Mondal R, Singh PK, Sarkar R, Bhattacharjee D (2021) Transfer learning with fine tuning for human action recognition from still images. Multimed Tools Appl 2021 8013 80(13):20547–20578. https://doi.org/10.1007/S11042-021-10753-Y
Article Google Scholar
Chapariniya M, Ashrafi SS, Shokouhi SB (2020) “Knowledge Distillation Framework for Action Recognition in Still Images”, 2020 10h Int. Conf Comput Knowl Eng ICCKE 2020, pp. 274–277, https://doi.org/10.1109/ICCKE50421.2020.9303716
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2016) “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Accessed: Aug. 12, 2021. [Online]. Available: https://arxiv.org/abs/1606.00915v2
Chollet F (2016) “Xception: Deep Learning with Depthwise Separable Convolutions,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1800–1807, Accessed: Aug. 11, 2021. [Online]. Available: https://arxiv.org/abs/1610.02357v3
Chu J, Guo Z, Leng L (Mar. 2018) Object detection based on multi-layer convolution feature fusion and online hard example mining. IEEE Access 6:19959–19967. https://doi.org/10.1109/ACCESS.2018.2815149
Article Google Scholar
Dehkordi HA, Nezhad AS, Ashrafi SS, Shokouhi SB (2021) “Still Image Action Recognition Using Ensemble Learning,” 2021 7th Int. Conf Web Res ICWR 2021, pp. 125–129, https://doi.org/10.1109/ICWR51868.2021.9443021
Dehkordi HA, Nezhad AS, Kashiani H, Shokouhi SB, Ayatollahi A (2022) “Multi-expert human action recognition with hierarchical super-class learning”, Knowledge-Based Syst., p. 109091, https://doi.org/10.1016/J.KNOSYS.2022.109091
Dosovitskiy A et al. (2020) “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, Accessed: Aug. 12, 2021. [Online]. Available: https://arxiv.org/abs/2010.11929v2
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (Jun. 2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Gkioxari G, Girshick R, Malik J (2015) “Contextual action recognition with R∗CNN,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp 1080–1088 https://doi.org/10.1109/ICCV.2015.129
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) “Deep residual learning for image recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 770–778, https://doi.org/10.1109/CVPR.2016.90
He K, Gkioxari G, Dollár P, Girshick R (Feb. 2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42(2):386–397. https://doi.org/10.1109/TPAMI.2018.2844175
Article Google Scholar
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21. https://doi.org/10.1016/j.imavis.2017.01.010
Article Google Scholar
Hinton G, Vinyals O, Dean J (2015) “Distilling the Knowledge in a Neural Network”, Accessed: Aug. 11, 2021. [Online]. Available: https://arxiv.org/abs/1503.02531v1.
Hu T, Zhu X, Guo W, Wang S, Zhu J (Feb. 2018) Human action recognition based on scene semantics. Multimed Tools Appl 2018 7820 78(20):28515–28536. https://doi.org/10.1007/S11042-017-5496-X
Article Google Scholar
Kim S, Yun K, Park J, Choi JY (2019) “Skeleton-based Action Recognition of People Handling Objects”, Proc. - 2019 IEEE Winter Conf. Appl. Comput. Vision, WACV 2019, pp. 61–70, Accessed: Aug. 13, 2021. [Online]. Available: https://arxiv.org/abs/1901.06882v1
Kipf TN, Welling M(2016) “Semi-Supervised Classification with Graph Convolutional Networks,” 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track Proc., Accessed: Aug. 13, 2021. [Online]. Available: https://arxiv.org/abs/1609.02907v4
Li LJ, Fei-Fei L (2007) “What, where and who? Classifying events by scene and object recognition”, https://doi.org/10.1109/ICCV.2007.4408872
Li Y, Li K, Wang X (Aug. 2020) Recognizing actions in images by fusing multiple body structure cues. Pattern Recogn 104:107341. https://doi.org/10.1016/j.patcog.2020.107341
Article Google Scholar
Liao X, Li K, Zhu X, Liu KJR (Aug. 2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE J Sel Top Signal Proc 14(5):955–968. https://doi.org/10.1109/JSTSP.2020.3002391
Article Google Scholar
Liu L, Tan RT, You S (2019) “Loss Guided Activation for Action Recognition in Still Images”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11365 LNCS, pp. 152–167, https://doi.org/10.1007/978-3-030-20873-8_10
Ludl D, Gulde T, Curio C (2019) “Simple yet efficient real-time pose-based action recognition”, in 2019 IEEE Intelligent Transportation Systems Conference, ITSC 2019, pp. 581–588, https://doi.org/10.1109/ITSC.2019.8917128
Ma W, Liang S (2020) “Human-object relation network for action recognition in still images”, Proc. - IEEE Int. Conf. Multimed. Expo, vol. 2020-July, https://doi.org/10.1109/ICME46284.2020.9102933.
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (Aug. 2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recogn 68:334–345. https://doi.org/10.1016/j.patcog.2017.01.027
Article Google Scholar
Maji S, Bourdev L, Malik J “Action Recognition from a Distributed Representation of Pose and Appearance”
McAuley J, Leskovec J (2012) “Image labeling on a network: Using social-network metadata for image classification,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7575 LNCS, no. PART 4, pp. 828–841, https://doi.org/10.1007/978-3-642-33765-9_59.
Mi S, Zhang Y (2021) Pose-guided action recognition in static images using lie-group. Appl Intell 2021:1–9. https://doi.org/10.1007/S10489-021-02760-1
Article Google Scholar
Mohammadi S, Majelan SG, Shokouhi SB (2019) “Ensembles of deep neural networks for action recognition in still images”, 2019 9th Int. Conf. Comput. Knowl. Eng. ICCKE 2019, pp. 315–318, https://doi.org/10.1109/ICCKE48569.2019.8965014
Procesi C (2007) “Lie groups : an approach through invariants and representations,” p. 596
Qi T, Xu Y, Quan Y, Wang Y, Ling H (Dec. 2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488. https://doi.org/10.1016/j.neucom.2017.06.041
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Ren Z, Zhang Q, Gao X, Hao P, Cheng J (Mar. 2020) Multi-modality learning for human action recognition. Multimed Tools Appl 2020 8011 80(11):16185–16203. https://doi.org/10.1007/S11042-019-08576-Z
Article Google Scholar
Simonyan K, Zisserman A, “Two-Stream Convolutional Networks for Action Recognition in Videos.”
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) “Rethinking the Inception Architecture for Computer Vision”, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 2818–2826, Accessed: Aug. 11, 2021. [Online]. Available: https://arxiv.org/abs/1512.00567v3.
Szegedy C et al. (2015) “Going deeper with convolutions”, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07–12-June-2015, pp. 1–9, https://doi.org/10.1109/CVPR.2015.7298594
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2016) “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” 31st AAAI Conf. Artif. Intell. AAAI 2017, pp. 4278–4284, Accessed: Aug. 11, 2021. [Online]. Available: https://arxiv.org/abs/1602.07261v2
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2017) “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 6450–6459, Accessed: Aug. 13, 2021. [Online]. Available: https://arxiv.org/abs/1711.11248v3
Wang J, Liang S, “Pose-Enhanced Relation Feature for Action Recognition in Still Images” (2022) pp. 154–165, https://doi.org/10.1007/978-3-030-98358-1_13
Wang X, Qi C (Dec. 2019) Detecting action-relevant regions for action recognition using a three-stage saliency detection technique. Multimed Tools Appl 2019 7911 79(11):7413–7433. https://doi.org/10.1007/S11042-019-08535-8
Article Google Scholar
Wang C, Yang H, Meinel C (2016) “Exploring multimodal video representation for action recognition,” Proc. Int. Jt. Conf. Neural Networks, vol. 2016-October, pp. 1924–1931, https://doi.org/10.1109/IJCNN.2016.7727435
Xin M, Wang S, Cheng J (2019) “Entanglement loss for context-based still image action recognition,” in Proceedings - IEEE International Conference on Multimedia and Expo, vol. 2019-July, pp. 1042–1047, https://doi.org/10.1109/ICME.2019.00183
Xu Y, Hou Z, Liang J, Chen C, Jia L, Song Y (May 2019) Action recognition using weighted fusion of depth images and skeleton’s key frames. Multimed Tools Appl 2019 7817 78(17):25063–25078. https://doi.org/10.1007/S11042-019-7593-5
Article Google Scholar
Yan S, Smith JS, Lu W, Zhang B (Dec. 2018) Multibranch attention networks for action recognition in still images. IEEE Trans Cogn Dev Syst 10(4):1116–1125. https://doi.org/10.1109/TCDS.2017.2783944
Article Google Scholar
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L (2011) “Human action recognition by learning bases of action attributes and parts,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1331–1338, https://doi.org/10.1109/ICCV.2011.6126386
Zhang Y, Chu J, Leng L, Miao J (2020) Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation. Sensors (Basel) 20(4). https://doi.org/10.3390/S20041010
Zhao Z, Ma H, You S (2017) “Single Image Action Recognition Using Semantic Body Part Actions,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 3411–3419, https://doi.org/10.1109/ICCV.2017.367
Zheng Y, Zheng X, Lu X, Wu S (Nov. 2020) Spatial attention based visual semantic learning for action recognition in still images. Neurocomputing 413:383–396. https://doi.org/10.1016/J.NEUCOM.2020.07.016
Article Google Scholar
Zhu Y et al. (2020) “A Comprehensive Study of Deep Video Action Recognition”, Accessed: Aug. 12, 2021. [Online]. Available: https://arxiv.org/abs/2012.06567v1.
Zoph B, Vasudevan V, Shlens J, Le QV (2017) “Learning Transferable Architectures for Scalable Image Recognition”, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 8697–8710, Accessed: Aug. 11, 2021. [Online]. Available: https://arxiv.org/abs/1707.07012v4.

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Electrical Engineering Department, Iran University of Science and Technology (IUST), Tehran, Iran
Seyed Sajad Ashrafi, Shahriar B. Shokouhi & Ahmad Ayatollahi

Authors

Seyed Sajad Ashrafi
View author publications
You can also search for this author in PubMed Google Scholar
Shahriar B. Shokouhi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Ayatollahi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahriar B. Shokouhi.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ashrafi, S.S., Shokouhi, S.B. & Ayatollahi, A. Still image action recognition based on interactions between joints and objects. Multimed Tools Appl 82, 25945–25971 (2023). https://doi.org/10.1007/s11042-023-14350-z

Download citation

Received: 13 November 2021
Revised: 01 June 2022
Accepted: 02 January 2023
Published: 10 January 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11042-023-14350-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Still image action recognition based on interactions between joints and objects

Abstract

Access this article

Similar content being viewed by others

VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition

Attention Focused Spatial Pyramid Pooling for Boxless Action Recognition in Still Images

Multi-modality learning for human action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Still image action recognition based on interactions between joints and objects

Abstract

Access this article

Similar content being viewed by others

VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition

Attention Focused Spatial Pyramid Pooling for Boxless Action Recognition in Still Images

Multi-modality learning for human action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation