Skip to main content
Log in

Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

  • Regular paper
  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

The field of human activity recognition is evolving at a quick pace. Indeed, over the last two decades, several approaches have been proposed to recognize human activities from generic videos, but still limited for daily living videos which have more characteristics that make them much more complex to manage. In fact, they present several challenges to overcome, such as; camera view variations, time information representation, inter-class variation between similar actions, fine-grained actions representation and high intra-class variation. Generally, the recognition of the action requires the extraction of spatial and temporal information in the videos. To extract temporal information, several works based on the LSTM network have been published. Although, they have proven their great potential in this field, they fail to model long range temporal information in very long video sequences. We have hence thought of using Transformer networks to propose a new pose-guided self-attention mechanism combined to 3D convolutional neural networks (3D CNN) by a Bilinear Pooling Attention module (BPA) which allows the spatial-temporal skeleton features to recalibrate the RGB features for Daily Living Activity (DLA) recognition. In addition, the majority of the implemented datasets are static and do not show strong variations in movement over time. We then thought of going towards a large-scale dataset called NTU RGB+D, since it contains RGB-D human actions that evolve much more over time. The Experimental results demonstrate that our Spatial Temporal Self Attention mechanism combined to 3D CNN through BPA module (ST-SA-BPA) outperforms state-of-the-art methods in terms of performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

All data analysed during all this study are available from the NTU-RGB+D [49]

References

  1. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308 (2017)

  2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016)

  3. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017)

  4. Donahue, J., Hendricks, A. L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2625–2634 (2015)

  5. Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 4041–4049 (2015)

  6. Kim, J.H., Hong, G.S., Kim, B.G., Dogra, D.P.: deepgesture: Deep learning-based gesture recognition scheme using motion sensors. Displays 55, 38–45

  7. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization arXiv:1409.2329

  8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, ... A. N., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process Syst. 30 (2017)

  9. Faugeras, F., Naccache, L.: Dissociating temporal attention from spatial attention and motor response preparation: a high-density eeg study. NeuroImage 124, 947–957 (2016)

    Article  Google Scholar 

  10. Qiu, S., Zhao, H., Jiang, N., Wang, Z., Liu, L., An, Y., Zhao, H., Miao, X., Liu, R., Fortino, G.: Multi-sensor information fusion based on machine learning for real applications in human activity recognition: State-of-the-art and research challenges. Inf. Fusion 80, 241–265 (2022)

    Article  Google Scholar 

  11. Li, Y., Yang, G., Su, Z., Li, S., Wang, Y.: Human activity recognition based on multienvironment sensor data. Inf. Fusion. 91, 47–63 (2023)

    Article  Google Scholar 

  12. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, 3551–3558 (2013)

  13. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 31, 3551–3558 (2017)

  14. Chéron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 218–3226 (2017)

  15. Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Cnn-svm learning approach based human activity recognition. In: Proceedings of the International Conference on Image and Signal Processing, Springer, 271–281 (2020)

  16. Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Lahar-cnn: human activity recognition from one image using convolutional neural network learning approach. Int J Biomet 13(4), 385–408 (2021)

    Article  Google Scholar 

  17. Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Dtr-har: deep temporal residual representation for human activity recognition. Vis Comput 38(3), 993–1013 (2022)

    Article  Google Scholar 

  18. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014)

  19. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 53(1), 221–231 (2012)

    Article  Google Scholar 

  20. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 4489–4497 (2015)

  21. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009). IEEE

  22. Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: Proceedings of the IEEE International Conference on Computer Vision, 2137–2146 (2017)

  23. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ..., Zisserman, A.: The kinetics human action video dataset. arXiv:1705.06950

  24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks arXiv:1609.02907

  25. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)

  26. Chaolong, L., Zhen, C., Wenming, Z., Chunyan, X., Jian, Y.: Spatio-temporal graph convolution for skeleton based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  27. Li, B., Li, X., Zhang, Z., Wu, F.: Spatio-temporal graph routing for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 8561–8568 (2019)

  28. Gao, X., Hu, W., Tang, J., Liu, J., Guo, Z.: Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM International Conference on Multimedia, 601–610 (2019)

  29. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12026–12035 (2019)

  30. Li, M., Chen, S., Chen, Y., Zhang, X., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3595–3603 (2019)

  31. Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5323–5332 (2018)

  32. Peng, W., Hong, X., Chen, H., Zhao, G.: Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2669–2676 (2020)

  33. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016) arXiv:1611.01578

  34. Yang, G., Liu, S., Li, Y., He, L.: Short-term prediction method of blood glucose based on temporal multi-head attention mechanism for diabetic patients. Biomed. Signal Process. Control 82, 104552 (2023)

    Article  Google Scholar 

  35. Wang, Y., Yang, G., Li, S., Li, Y., He, L., Liu, D.: Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control 79, 104206 (2023)

    Article  Google Scholar 

  36. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

  37. Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention arXiv:1511.04119 (2015)

  38. Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. Adv. Neural Inf. Process. Syst. 30 (2017)

  39. Long, X., Gan, C., De Melo, G., Wu, J., Liu, X., Wen, S.: Attention clusters: Purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7834–7843 (2018)

  40. Baradel, F., Wolf, C., Mille, J., Taylor, G.W.: Glimpse clouds: Human activity recognition from unstructured feature points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 469–478 (2018)

  41. Chen, K., Yao, L., Zhang, D., Wang, X., Chang, X., Nie, F.: A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 31(5), 1747–1756 (2019)

    Article  Google Scholar 

  42. Araei, S., Nadian-Ghomsheh, A.: Spatio-temporal 3d action recognition with hierarchical self-attention mechanism. In: 26th International Computer Conference, Computer Society of Iran (CSICC), 1–5 (2021). IEEE

  43. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803 (2018)

  44. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 244–253 (2019)

  45. Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. In: International Conference on Pattern Recognition, Springer 694–701 (2021)

  46. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, PMLR 448–456 (2015)

  47. Nguyen, T.Q., Salazar, J.: Transformers without tears: Improving the normalization of self-attention arXiv:1910.05895 (2019)

  48. Weiyao, X., Muqing, W., Min, Z., Ting, X.: Fusion of skeleton and rgb features for rgb-d human action recognition. IEEE Sens J 21(17), 19157–19164 (2021)

    Article  Google Scholar 

  49. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: Mmtm: Multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13289–13299 (2020)

  50. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization arXiv:1412.6980 (2014)

  51. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7912–7921 (2019)

  52. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6546–6555 (2018)

  53. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 143–152 (2020)

  54. Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1012–1020 (2017)

  55. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, 2117–2126 (2017)

  56. Baradel, F., Wolf, C., Mille, J.: Human action recognition: Pose-based attention draws focus to hands. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 604–613 (2017)

  57. Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to rgb. In: BMVC 2018-29th British Machine Vision Conference, 1–14 (2018)

  58. Liu, G., Qian, J., Wen, F., Zhu, X., Ying, R., Liu, P.: Action recognition based on 3d skeleton and rgb frame fusion. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE 258–264 (2019)

  59. Baradel, C., Wolf, F., Mille, J., Taylor, G.W.: Glimpse clouds: Human activity recognition from unstructured feature points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 469–478 (2018)

  60. Shi, F., Lee, C., Qiu, L., Zhao, Y., Shen, T., Muralidhar, S...., Narayanan, V.: Star: Sparse transformer-based action recognition arXiv:2107.07089 (2021)

  61. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation arXiv:1804.06055 (2018)

  62. Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 635–644 (2020)

  63. Sun, Y., Shen, Y., Ma, L.: Msst-rt: Multi-stream spatial-temporal relative transformer for skeleton-based action recognition. Sensors 21(16), 5339 (2021)

    Article  Google Scholar 

  64. Zhang, Z., Wang, Z., Zhuang, S., Huang, F.: Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8, 228108–228117 (2020)

    Article  Google Scholar 

  65. Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1159–1168 (2018)

  66. Das, S., Dai, R., Koperski, M., Minciullo, L., Garattoni, L., Bremond, F., Francesca, G.: Toyota smarthome: Real-world activities of daily living. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 833–842 (2019)

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Hend Basly, Mohamed Amine Zayene and Fatma Ezzahra Sayadi. The first draft of the manuscript was written by Hend Basly and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Hend Basly or Mohamed Amine Zayene.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Consent to Participate

Informed consent was obtained from all individual participants included in the study.

Consent for Publication

The participant has consented to the submission of the research manuscript to the journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Basly, H., Zayene, M.A. & Sayadi, F.E. Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition. J Intell Robot Syst 109, 2 (2023). https://doi.org/10.1007/s10846-023-01926-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10846-023-01926-y

Keywords

Navigation