Skip to main content
Log in

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Human activity recognition (HAR) in video streams becomes a thriving research area in computer vision and pattern recognition. Activity recognition in actual video is quite demanding due to a lack of data with respect to motion, way or style, and cluttered background. The current HAR approaches primarily apply pre-trained weights of various deep learning (DL) models for the apparent description of frames during the learning phase. It impacts the assessment of feature discrepancies, like the separation between both the temporal and visual cues. To address this issue, a residual deep gated recurrent unit (RD-GRU)-enabled attention framework with a dilated convolutional neural network (DiCNN) is introduced in this article. This approach particularly targets potential information in the input video frame to recognize the distinct activities in the videos. The DiCNN network is used to capture the crucial, unique features. In this network, the skip connection segment is employed with DiCNN to update the information that retains more knowledge than a shallow layer. Moreover, these features are fed into an attention module to capture the added high-level discriminative action associated with patterns and signs. The attention mechanism is followed by an RD-GRU to learn the long video sequences in order to enhance the performance. The performance metrics, namely accuracy, precision, recall, and f1-score, are used to evaluate the performance of the introduced model on four diverse benchmark datasets: UCF11, UCF Sports, JHMDB, and THUMOS. On these datasets it achieves an accuracy of 98.54%, 99.31%, 82.47%, and 95.23%, respectively. This illustrates the validity of the proposed work compared with state-of-the-art (SOTA) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data supporting the funding of this manuscript are available on the links https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php for UCF11, https://www.crcv.ucf.edu/data/UCF_Sports_Action.php for UCF Sports Action, and http://jhmdb.is.tue.mpg.de/dataset for JHMDB dataset.

References

  1. Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis. Knowl.-Based Syst. 188, 1–10 (2019)

    Google Scholar 

  2. Keshavarzian, A., Sharifian, S., Seyedin, S.: Modified deep residual network architecture deployed on serverless framework of IoT platform based on human activity recognition application. Future Gener. Comput. Syst. 101, 14–28 (2019)

  3. Antar, A.D., Ahmed, M., Ahad, M.A.R.: Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: A review, in: 2019 Joint 8th International Conference on Informatics, Electronics and Vision (ICIEV) and 3rd International Conference on Imaging, Vision and Pattern Recognition, IcIVPR, IEEE, (2019)

  4. da Costa, K.A., Papa, J.P., Lisboa, C.O., Munoz, R., de Albuquerque, V.H.C.: Internet of things: A survey on machine learning-based intrusion detection approaches. Comput. Netw. 151, 147–157 (2019)

    Article  Google Scholar 

  5. Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: A survey. Image Vis. Comput. 60, 4–21 (2017)

    Article  Google Scholar 

  6. Dai, C., Liu, X., Lai, J.: Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 86, 105820 (2020)

    Article  Google Scholar 

  7. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  PubMed  Google Scholar 

  8. Xu, J., Song, R., Wei, H., Guo, J., Zhou, Y., Huang, X.: A fast human action recognition network based on spatio-temporal features. Neurocomputing. 441, 350–358 (2021)

    Article  Google Scholar 

  9. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy. pp. 5534-5542 (2017)

  10. Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer. 37(7), 1821–1835 (2021)

    Article  Google Scholar 

  11. Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for target entities sentiment analysis. Knowl. Based Syst. 188(1), 1–10 (2020)

    Google Scholar 

  12. Wang, F. et al.: Residual attention network for image classification. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Honolulu, HI, USA, pp. 6450-6458 (2017)

  13. Di, Wu, Sharma, Nabin, Blumenstein, Michael: Recent advances in video-based human action recognition using deep learning: a review, in: 2017 International Joint Conference on Neural Networks. IJCNN. IEEE. pp. 2865-2872 (2017)

  14. Kwon, H., et al.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognit. Lett. 112, 161–167 (2018)

    Article  ADS  Google Scholar 

  15. Hejazi, S.M., Abhayaratne, C.: Handcrafted localized phase features for human action recognition. Image and Vision Computing. 123, 104465 (2022)

    Article  Google Scholar 

  16. Kumar, P., Rautaray, S. S., Agrawal, A.: Hand data glove: A new generation real-time mouse for human-computer interaction. In 2012 1st International Conference on Recent Advances in Information Technology (RAIT). IEEE. pp. 750-755 (2012)

  17. Zhao, Yuerong, Hongbo Guo, Ling Gao, Hai Wang, Jie Zheng, Kan Zhang, Yong Zheng: Multifeature fusion action recognition based on key frames. Concurrency and Computation: Practice and Experience. e6137 (2021)

  18. Wei, Xiu-Shen., Wang, Peng, Liu, Lingqiao, Shen, Chunhua, Jianxin, Wu.: Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. IEEE Transactions on Image Processing. 28(12), 6116–6125 (2019)

    Article  MathSciNet  PubMed  ADS  Google Scholar 

  19. Garcia-Garcia, Alberto, Orts-Escolano, Sergio, Oprea, Sergiu, Villena-Martinez, Victor, Martinez-Gonzalez, Pablo, Garcia-Rodriguez, Jose: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 70, 41–65 (2018)

    Article  Google Scholar 

  20. Lee, T.M., Yoon, J.-C., Lee, I.-K.: Motion sickness prediction in stereoscopic videos using 3D convolutional neural networks. IEEE Trans. Vis. Comput. Graphics. 25(5), 1919–1927 (2019)

    Article  Google Scholar 

  21. Khan, Samee Ullah: Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik, and Mi Young Lee: Cover the violence: A novel deep-learning-based approach towards violence-detection in movies. Appl. Sci. 9(22), 4963 (2019)

    Article  Google Scholar 

  22. Tu, Z., et al.: Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 79, 32–43 (2018)

    Article  ADS  Google Scholar 

  23. Gammulle, H. et al.: Two stream lstm: A deep fusion framework for human action recognition, in: 2017 IEEE Winter Conference on Applications of Computer Vision, WACV, IEEE. (2017)

  24. Pandey, A., Kumar, P., and Prasad, S.: 2D Convolutional LSTM-Based Approach for Human Action Recognition on Various Sensor Data. In Intelligent Data Engineering and Analytics: Proceedings of the 10th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA 2022). Singapore: Springer Nature Singapore. pp. 405-417 (2023)

  25. Zhang, Z., Yang, Y., Lv, Z., Gan, C., Zhu, Q.: LMFNet: Human Activity Recognition Using Attentive 3-D Residual Network and Multistage Fusion Strategy. IEEE Internet of Things Journal. 8(7), 6012–6023 (2020)

    Article  Google Scholar 

  26. Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Li, F.: Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2015)

    MathSciNet  Google Scholar 

  27. Li, D., Yao, T., Duan, L., Mei, T., Rui, Y.: Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Image Process. 21(2), 416–428 (2019)

    Google Scholar 

  28. Liu, Q., Che, X., Bie, M.: R-STAN: Residual spatio-temporal attention network for action recognition. IEEE Access. 7, 82246–82255 (2019)

    Article  Google Scholar 

  29. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep Bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)

    Article  Google Scholar 

  30. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  31. Khodabandelou, G., Jung, P.G., Amirat, Y., Mohammed, S.: Attention-based gated recurrent unit for gesture recognition. IEEE Transactions on Automation Science and Engineering. 18(2), 495–507 (2020)

    Article  Google Scholar 

  32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning forimage recognition. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, pp. 770-778 (2016)

  33. Ullah, A., Muhammad, K., Ding, W., Palade, V., Haq, I.U., Baik, S.W.: Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Applied Soft Computing 103, 107102 (2021)

    Article  Google Scholar 

  34. Vrskova, R., Hudec, R., Kamencay, P., Sykora, P.: Human activity classification using the 3DCNN architecture. Applied Sciences. 12(2), 931 (2022)

    Article  CAS  Google Scholar 

  35. Zhen, P., Yan, X., Wang, W., Wei, H., Chen, H. B.: A Highly Compressed Accelerator with Temporal Optical Flow Feature Fusion and Tensorized LSTM for Video Action Recognition on Terminal Device. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. (2023)

  36. Gharaee, Z., Gärdenfors, P., Johnsson, M.: First and second order dynamics in a hierarchical SOM system for action recognition. Appl. Soft Comput. 59, 574–585 (2017)

    Article  Google Scholar 

  37. Sahoo, S.P., Modalavalasa, S., Ari, S.: DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digital Signal Processing 131, 103763 (2022)

    Article  Google Scholar 

  38. Sowmyayani et al.: STHARNet: spatio-temporal human action recognition network in content-based video retrieval. Multimedia Tools and Applications. 1-16 (2022)

  39. Ma, M., et al.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recognit. 76, 506–521 (2018)

    Article  ADS  Google Scholar 

  40. Li, H., Hu, W., Zang, Y., Zhao, S.: Action recognition based on attention mechanism and depthwise separable residual module. Signal, Image, and Video Processing. 17(1), 57–65 (2023)

    Article  CAS  Google Scholar 

  41. Yang, W., Lyons, T., Ni, H., Schmid, C., Jin, L.: Developing the path signature methodology and its application to landmark-based human action recognition. In Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions. Cham: Springer International Publishing. pp. 431-464 (2022)

  42. Ahmad, T., Jin, L., Feng, J., Tang, G.: Human action recognition in unconstrained trimmed videos using residual attention network and joints path signature. IEEE Access. 7, 121212–121222 (2019)

    Article  Google Scholar 

  43. Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-Attention Network for Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1-5 March 2020; pp. 635-644. (2020)

  44. Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., Gong, B.: Movinets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19-25 June 2021; pp. 16020-16030. (2021)

  45. Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C.: Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18-24 June 2022; pp. 3333-3343. (2022)

  46. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild, in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. (2009)

  47. Soomro, K., Zamir, A. R.: Action recognition in realistic sports videos. In Computer vision in sports. Springer International Publishing. pp. 181-208 (2015)

  48. Jhuang, H. et al.: Towards understanding action recognition, in Proceedings of the IEEE international conference on computer vision. (2013)

  49. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv , arXiv:1212.0402. (2012)

  50. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770-778. (2014)

  51. Ullah, H., Munir, A.: Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. arXiv preprint arXiv:2208.05034 (2022)

  52. Muhammad, K., Ullah, A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., de Albuquerque, V.H.C.: Human action recognition using attention-based LSTM network with dilated CNN features. Future Generation Computer Systems. 125, 820–830 (2021)

    Article  Google Scholar 

  53. Malibari, A.A., Alzahrani, J.S., Qahmash, A., Maray, M., Alghamdi, M., Alshahrani, R., Hilal, A.M.: Quantum Water Strider Algorithm with Hybrid-Deep-Learning-Based Activity Recognition for Human-Computer Interaction. Applied Sciences. 12(14), 6848 (2022)

    Article  CAS  Google Scholar 

  54. Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: Mict: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018; pp. 449-458. (2018)

  55. Majd, M., Safabakhsh, R.: Correlational Convolutional LSTM for Human Action Recognition. Neurocomputing. 396, 224–229 (2020)

    Article  Google Scholar 

  56. Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.S.: Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 2874-2884. (2022)

  57. Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., Jiang, Y.G.: Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 18816-18826. (2023)

  58. Zhou, A., Ma, Y., Ji, W., Zong, M., Yang, P., Wu, M., Liu, M.: Multi-head attention-based two-stream EfficientNet for action recognition. Multimedia Systems. 29(2), 487–98 (2023)

    Article  Google Scholar 

  59. Zhang, C., Xu, Y., Xu, Z., Huang, J., Lu, J.: Hybrid handcrafted and learned feature framework for human action recognition. Applied Intelligence 52(11), 12771–12787 (2022)

    Article  Google Scholar 

  60. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV 2015, 4489–4497 (2015)

    Google Scholar 

  61. Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CVPR 2018, 6546–6555 (2018)

    Google Scholar 

  62. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham. (2016)

  63. Jiang, G., Jiang, X., Fang, Z., Chen, S.: An efficient attention module for 3d convolutional neural networks in action recognition. Applied Intelligence, 1-15. (2021)

Download references

Funding

No funding provided from any source is used for this research in this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

AP designed the environmental setup platform, concluded the experiments, and performed the statistical analysis, whereas PiK wrote the abstract section and literature survey. AP and PiK wrote the first draft of the manuscript. AP and PiK contributed to the investigation and framing of the results. PiK edited the first draft of this paper. Both authors participated in reviewing and approving the final version of the manuscript. Ajeet Pandey and Piyush Kumar contributed equally to this work.

Corresponding author

Correspondence to Ajeet Pandey.

Ethics declarations

Conflict of interest

Both authors declare that he or she has no conflict of interest concerning the research, authorship, and/or publication of this article.

Ethical approval

This article does not contain any studies with human participants performed by the authors. Therefore, this section does not apply to this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, A., Kumar, P. Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03266-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03266-w

Keywords

Navigation