Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

Basly, Hend; Zayene, Mohamed Amine; Sayadi, Fatma Ezahra

doi:10.1007/s10846-023-01926-y

Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

Regular paper
Published: 17 August 2023

Volume 109, article number 2, (2023)
Cite this article

Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

127 Accesses
Explore all metrics

Abstract

The field of human activity recognition is evolving at a quick pace. Indeed, over the last two decades, several approaches have been proposed to recognize human activities from generic videos, but still limited for daily living videos which have more characteristics that make them much more complex to manage. In fact, they present several challenges to overcome, such as; camera view variations, time information representation, inter-class variation between similar actions, fine-grained actions representation and high intra-class variation. Generally, the recognition of the action requires the extraction of spatial and temporal information in the videos. To extract temporal information, several works based on the LSTM network have been published. Although, they have proven their great potential in this field, they fail to model long range temporal information in very long video sequences. We have hence thought of using Transformer networks to propose a new pose-guided self-attention mechanism combined to 3D convolutional neural networks (3D CNN) by a Bilinear Pooling Attention module (BPA) which allows the spatial-temporal skeleton features to recalibrate the RGB features for Daily Living Activity (DLA) recognition. In addition, the majority of the implemented datasets are static and do not show strong variations in movement over time. We then thought of going towards a large-scale dataset called NTU RGB+D, since it contains RGB-D human actions that evolve much more over time. The Experimental results demonstrate that our Spatial Temporal Self Attention mechanism combined to 3D CNN through BPA module (ST-SA-BPA) outperforms state-of-the-art methods in terms of performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual attention network

Article Open access 28 July 2023

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Article 03 June 2022

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Article 16 April 2024

Data Availability

All data analysed during all this study are available from the NTU-RGB+D [49]

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017)
Donahue, J., Hendricks, A. L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2625–2634 (2015)
Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 4041–4049 (2015)
Kim, J.H., Hong, G.S., Kim, B.G., Dogra, D.P.: deepgesture: Deep learning-based gesture recognition scheme using motion sensors. Displays 55, 38–45
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization arXiv:1409.2329
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, ... A. N., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process Syst. 30 (2017)
Faugeras, F., Naccache, L.: Dissociating temporal attention from spatial attention and motor response preparation: a high-density eeg study. NeuroImage 124, 947–957 (2016)
Article Google Scholar
Qiu, S., Zhao, H., Jiang, N., Wang, Z., Liu, L., An, Y., Zhao, H., Miao, X., Liu, R., Fortino, G.: Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inf. Fusion 80, 241–265 (2022)
Article Google Scholar
Li, Y., Yang, G., Su, Z., Li, S., Wang, Y.: Human activity recognition based on multienvironment sensor data. Inf. Fusion. 91, 47–63 (2023)
Article Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, 3551–3558 (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 31, 3551–3558 (2017)
Chéron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 218–3226 (2017)
Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Cnn-svm learning approach based human activity recognition. In: Proceedings of the International Conference on Image and Signal Processing, Springer, 271–281 (2020)
Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Lahar-cnn: human activity recognition from one image using convolutional neural network learning approach. Int J Biomet 13(4), 385–408 (2021)
Article Google Scholar
Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Dtr-har: deep temporal residual representation for human activity recognition. Vis Comput 38(3), 993–1013 (2022)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 53(1), 221–231 (2012)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 4489–4497 (2015)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009). IEEE
Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: Proceedings of the IEEE International Conference on Computer Vision, 2137–2146 (2017)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ..., Zisserman, A.: The kinetics human action video dataset. arXiv:1705.06950
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks arXiv:1609.02907
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Chaolong, L., Zhen, C., Wenming, Z., Chunyan, X., Jian, Y.: Spatio-temporal graph convolution for skeleton based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Li, B., Li, X., Zhang, Z., Wu, F.: Spatio-temporal graph routing for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 8561–8568 (2019)
Gao, X., Hu, W., Tang, J., Liu, J., Guo, Z.: Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM International Conference on Multimedia, 601–610 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12026–12035 (2019)
Li, M., Chen, S., Chen, Y., Zhang, X., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3595–3603 (2019)
Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5323–5332 (2018)
Peng, W., Hong, X., Chen, H., Zhao, G.: Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2669–2676 (2020)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016) arXiv:1611.01578
Yang, G., Liu, S., Li, Y., He, L.: Short-term prediction method of blood glucose based on temporal multi-head attention mechanism for diabetic patients. Biomed. Signal Process. Control 82, 104552 (2023)
Article Google Scholar
Wang, Y., Yang, G., Li, S., Li, Y., He, L., Liu, D.: Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control 79, 104206 (2023)
Article Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention arXiv:1511.04119 (2015)
Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. Adv. Neural Inf. Process. Syst. 30 (2017)
Long, X., Gan, C., De Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: Purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7834–7843 (2018)
Baradel, F., Wolf, C., Mille, J., Taylor, G.W.: Glimpse clouds: Human activity recognition from unstructured feature points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 469–478 (2018)
Chen, K., Yao, L., Zhang, D., Wang, X., Chang, X., Nie, F.: A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 31(5), 1747–1756 (2019)
Article Google Scholar
Araei, S., Nadian-Ghomsheh, A.: Spatio-temporal 3d action recognition with hierarchical self-attention mechanism. In: 26th International Computer Conference, Computer Society of Iran (CSICC), 1–5 (2021). IEEE
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803 (2018)
Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 244–253 (2019)
Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. In: International Conference on Pattern Recognition, Springer 694–701 (2021)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR 448–456 (2015)
Nguyen, T.Q., Salazar, J.: Transformers without tears: Improving the normalization of self-attention arXiv:1910.05895 (2019)
Weiyao, X., Muqing, W., Min, Z., Ting, X.: Fusion of skeleton and rgb features for rgb-d human action recognition. IEEE Sens J 21(17), 19157–19164 (2021)
Article Google Scholar
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: Multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13289–13299 (2020)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization arXiv:1412.6980 (2014)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7912–7921 (2019)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6546–6555 (2018)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 143–152 (2020)
Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1012–1020 (2017)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, 2117–2126 (2017)
Baradel, F., Wolf, C., Mille, J.: Human action recognition: Pose-based attention draws focus to hands. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 604–613 (2017)
Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to rgb. In: BMVC 2018-29th British Machine Vision Conference, 1–14 (2018)
Liu, G., Qian, J., Wen, F., Zhu, X., Ying, R., Liu, P.: Action recognition based on 3d skeleton and rgb frame fusion. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE 258–264 (2019)
Baradel, C., Wolf, F., Mille, J., Taylor, G.W.: Glimpse clouds: Human activity recognition from unstructured feature points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 469–478 (2018)
Shi, F., Lee, C., Qiu, L., Zhao, Y., Shen, T., Muralidhar, S...., Narayanan, V.: Star: Sparse transformer-based action recognition arXiv:2107.07089 (2021)
Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation arXiv:1804.06055 (2018)
Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 635–644 (2020)
Sun, Y., Shen, Y., Ma, L.: Msst-rt: Multi-stream spatial-temporal relative transformer for skeleton-based action recognition. Sensors 21(16), 5339 (2021)
Article Google Scholar
Zhang, Z., Wang, Z., Zhuang, S., Huang, F.: Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8, 228108–228117 (2020)
Article Google Scholar
Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1159–1168 (2018)
Das, S., Dai, R., Koperski, M., Minciullo, L., Garattoni, L., Bremond, F., Francesca, G.: Toyota smarthome: Real-world activities of daily living. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 833–842 (2019)

Download references

Author information

Authors and Affiliations

NOCCS-Labb.: Networked Objects Control and Communication Systems Laboratory, National Engineering School of Sousse (ENISO), BP 264, 4023, Erriadh, Sousse, Tunisia
Hend Basly, Mohamed Amine Zayene & Fatma Ezahra Sayadi

Authors

Hend Basly
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Amine Zayene
View author publications
You can also search for this author in PubMed Google Scholar
Fatma Ezahra Sayadi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Hend Basly, Mohamed Amine Zayene and Fatma Ezzahra Sayadi. The first draft of the manuscript was written by Hend Basly and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Hend Basly or Mohamed Amine Zayene.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Consent to Participate

Informed consent was obtained from all individual participants included in the study.

Consent for Publication

The participant has consented to the submission of the research manuscript to the journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Basly, H., Zayene, M.A. & Sayadi, F.E. Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition. J Intell Robot Syst 109, 2 (2023). https://doi.org/10.1007/s10846-023-01926-y

Download citation

Received: 27 September 2022
Accepted: 07 July 2023
Published: 17 August 2023
DOI: https://doi.org/10.1007/s10846-023-01926-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

Abstract

Access this article

Similar content being viewed by others

Visual attention network

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation