Skip to main content
Log in

A human activity recognition framework in videos using segmented human subject focus

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Automating tasks through human activity recognition in video data has become increasingly vital. Deep learning has yielded versatile activity recognition systems applicable in surveillance, healthcare analysis, sports, and human–computer interaction. Despite various proposed video-based activity recognition techniques over the years, the reliance over RGB frames, accompanied by other modalities like joint locations and depth maps, often proves less effective compared to multimodal methods. In response to this challenge, our paper introduces a competitive approach for identifying human activity in video frames. Leveraging a Convolutional Long Short-Term Memory (Conv-LSTM) network and a novel pre-processing step involving a Human Segmentation network, our method accentuates human subjects in each frame using segmentation maps. These highlighted frames undergo further processing through Convolutional Neural Networks (CNNs) to learn feature vectors, without including other modalities with RGB frames directly. The learned features are then subjected to Long Short-Term Memory (LSTM) units for comprehending sequential video data and drawing meaningful inferences. The proposed methodology undergoes rigorous testing on three publicly available datasets—KARD, MSR Daily Activity, and SBU-Interactions. Remarkably, our approach outperforms similar state-of-the-art methods, achieving benchmark accuracy scores exceeding 98% on MSR Daily Activity and 99% on KARD and SBU-Interactions datasets. In essence, our method not only provides a competitive solution for human activity recognition in video frames but also contributes to advancing the field by integrating Conv-LSTM networks and innovative pre-processing techniques. The comprehensive evaluation on multiple datasets underlines the robustness and superior performance of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability statement

Datasets used in this work are available in public repository. https://data.mendeley.com/datasets/k28dtm7tr6/1, https://sites.google.com/view/wanqingli/data-sets/msr-dailyactivity3d, https://cove.thecvf.com/datasets/57#:~:text=A%20complex%20human%20activity%20dataset,depth%20and%20motion%20capture%20data.

References

  1. Karpathy, A., Toderici, G., Shetty S., Leung, T., Sukthankar, R., Li, F. F.: Large-scale video classification with convolutional neural networks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223.

  2. Zeng, M., et al.: Convolutional Neural Networks for human activity recognition using mobile sensors. In: Proc. 2014 6th Int. Conf. Mob. Comput. Appl. Serv. MobiCASE 2014, vol. 6, pp. 197–205, (2015). https://doi.org/10.4108/icst.mobicase.2014.257786.

  3. Dhiman, C., Vishwakarma, D. K.: A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Appl. Artif. Intell., 77( June 2018), 21–45 (2019). https://doi.org/10.1016/j.engappai.2018.08.014.

  4. Dhiman, C., Vishwakarma, D. K., Agarwal, P.: Part-wise spatiooral attention driven CNN-based 3D human action recognition. ACM Trans. Multimed. Comput. Commun. Appl., 17(3) (2021). https://doi.org/10.1145/3441628.

  5. Dhiman, C., Vishwakarma, D. K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020). https://doi.org/10.1109/TIP.2020.2965299.

  6. Vishwakarma, D.K., Kapoor, R.: Hybrid classifier based human activity recognition using the silhouette and cells. Expert Syst. Appl. 42(20), 6957–6965 (2015). https://doi.org/10.1016/j.eswa.2015.04.039

    Article  Google Scholar 

  7. Straka, M., Hauswiesner, S., Rüther, M., Bischof, H.: Skeletal graph based human pose estimation in real-time. In: BMVC,: Proc. Br. Mach. Vis. Conf. 2011, 2011 (2011). https://doi.org/10.5244/C25.69

  8. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1(January), 568–576 (2014)

    Google Scholar 

  9. Jain, A., Vishwakarma, D. K.: State-of-the-arts violence detection using ConvNets. In: Proc. 2020 IEEE Int. Conf. Commun. Signal Process. ICCSP 2020, pp. 813–817, (2020). https://doi.org/10.1109/ICCSP48568.2020.9182433.

  10. Yadav, A., Vishwakarma, D.K.: A deep learning architecture of RA-DLNet for visual sentiment analysis. Multimed. Syst. 26(4), 431–451 (2020). https://doi.org/10.1007/s00530-020-00656-7

    Article  Google Scholar 

  11. Sabih, M., Vishwakarma, D. K.: Crowd anomaly detection with LSTMs using optical features and domain knowledge for improved inferring. Vis. Comput., no. 0123456789, (2021). https://doi.org/10.1007/s00371-021-02100-x.

  12. Jeevan, M., Jain, N., Hanmandlu, M., Chetty, G.: Gait recognition based on gait pal and pal entropy image. In: 2013 IEEE Int. Conf. Image Process. ICIP 2013 - Proc., pp. 4195–4199 (2013). https://doi.org/10.1109/ICIP.2013.6738864.

  13. Huang, W., Zhang, L., Wu, H., Min, F., Song, A.: Channel-equalization-HAR: a light-weight convolutional neural network for wearable sensor based human activity recognition. IEEE Trans. Mob. Comput. 22(9), 5064–5077 (2023). https://doi.org/10.1109/TMC.2022.3174816

    Article  Google Scholar 

  14. Wenbo, H., Zhang, L., Wang, S., Wu, H., Song, A.: Deep ensemble learning for human activity recognition using wearable sensors via filter activation. ACM Trans. Embed. Comput. Syst. 22(1), 1–23 (2022). https://doi.org/10.1145/3551486

    Article  Google Scholar 

  15. Garcia, N. C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11212 LNCS, pp. 106–121 (2018). https://doi.org/10.1007/978-3-030-01237-3_7.

  16. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 32nd AAAI Conf. Artif. Intell. AAAI 2018, 7444–7452 (2018)

  17. Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59

    Article  PubMed  Google Scholar 

  18. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Proc. Second Int. Conf. Hum. Behav. Underst., vol. 34, no. 8, 2012.

  19. Yan, G., Hua, M., Zhong, Z.: Multi-derivative physical and geometric convolutional embedding networks for skeleton-based action recognition[Formula presented]. Comput. Aided Geom. Des., 86 (2021). https://doi.org/10.1016/j.cagd.2021.101964.

  20. Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33(1), 469–485 (2021). https://doi.org/10.1007/s00521-020-05018-y

    Article  CAS  Google Scholar 

  21. Zebhi, S., AlModarresi, S.M.T., Abootalebi, V.: Human activity recognition using pre-trained network with informative templates. Int. J. Mach. Learn. Cybern. 12(12), 3449–3461 (2021). https://doi.org/10.1007/s13042-021-01383-9

    Article  Google Scholar 

  22. Fei-Fei, L., Deng, J., Li, K.: ImageNet: constructing a large-scale image database. J. Vis. 9(8), 1037–1037 (2010). https://doi.org/10.1167/9.8.1037

    Article  Google Scholar 

  23. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S. W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access, 6, 1155–1166 (2017). https://doi.org/10.1109/ACCESS.2017.2778011.

  24. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q.: Densely connected convolutional networks. In: Proc. 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017, pp. 2261–2269 (2017). https://doi.org/10.1109/CVPR.2017.243.

  25. Weng, W., Zhu, X.: UNet: convolutional networks for biomedical image segmentation. IEEE Access 9, 16591–16603 (2015). https://doi.org/10.1109/ACCESS.2021.3053408

    Article  Google Scholar 

  26. “Supervisely Person Dataset - Datasets - Supervisely.” https://supervise.ly/explore/projects/supervisely-person-dataset-23304/datasets (accessed May 30, 2021).

  27. Tan, M., Le, Q. V.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: 36th Int. Conf. Mach. Learn. ICML 2019, vol. 2019-June, pp. 10691–10700 (2019).

  28. He, K., Zhang, S., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90.

  29. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308.

  30. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1800–1807 (2017). https://doi.org/10.1109/CVPR.2017.195.

  31. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: 31st AAAI Conf. Artif. Intell. AAAI 2017, 4278–4284 (2017)

  32. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L. C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474.

  33. Hu, J.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745.

  34. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1780, 1735–1780, 1997, [Online]. https://doi.org/10.1162/neco.1997.9.8.1735.

  35. Schuster, M., Paliwal, K. K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). [Online]. Available: https://doi.org/10.1109/78.650093.

  36. Gaglio, S., Lo Re, G., Member, S., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Human-Mach. Syst. 45(5), 586–597 (2015). https://doi.org/10.1109/THMS.2014.2377111.

  37. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1290–1297 (2012). https://doi.org/10.1109/CVPR.2012.6247813.

  38. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 28–35 (2012). https://doi.org/10.1109/CVPRW.2012.6239234.

  39. Ashwini, K., Amutha, R.: Compressive sensing based recognition of human upper limb motions with kinect skeletal data. Multimed. Tools Appl. 80(7), 10839–10857 (2021). https://doi.org/10.1007/s11042-020-10327-4

    Article  Google Scholar 

  40. El Madany, N.E.D., He, Y., Guan, L.: Integrating entropy skeleton motion maps and convolutional neural networks for human action recognition. In: Proc. IEEE Int. Conf. Multimed. Expo (2018). https://doi.org/10.1109/ICME.2018.8486480

  41. Cippitelli, E., Gasparrini, S., Gambi, E., Spinsante, S.: A human activity recognition system using skeleton data from RGBD sensors. Comput. Intell. Neurosci. (2016). https://doi.org/10.1155/2016/4351435.

  42. Dhiman, C., Vishwakarma, D.K.: A robust framework for abnormal human action recognition using\boldsymbol{\mathcal{r}} -transform and zernike moments in depth videos. IEEE Sens. J. 19(13), 5195–5203 (2019). https://doi.org/10.1109/JSEN.2019.2903645

    Article  ADS  Google Scholar 

  43. Andrade-Ambriz, Y.A., Ledesma, S., Ibarra-Manzano, M.A., Oros-Flores, M.I., Almanza-Ojeda, D.L.: Human activity recognition using temporal convolutional neural network architecture. Expert Syst. Appl. 191(December), 2022 (2021). https://doi.org/10.1016/j.eswa.2021.116287

    Article  Google Scholar 

  44. Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1045–1058 (2018). https://doi.org/10.1109/TPAMI.2017.2691321

    Article  PubMed  Google Scholar 

  45. Huynh-The, T., et al.: Hierarchical topic modeling with pose-transition feature for action recognition using 3D skeleton data. Inf. Sci. (Ny) 444, 20–35 (2018). https://doi.org/10.1016/j.ins.2018.02.042

    Article  MathSciNet  ADS  Google Scholar 

  46. Zhu, J. et al.: Action machine: rethinking action recognition in trimmed videos. (2018). [Online]. Available: http://arxiv.org/abs/1812.05770.

  47. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: 31st AAAI Conf. Artif. Intell. AAAI, 2017, 4263–4270 (2017)

  48. Elboushaki, A., Hannane, R., Afdel, K., Koutti, L.: MultiD-CNN: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences. Expert Syst. Appl. 139, 112829 (2020). https://doi.org/10.1016/j.eswa.2019.112829

    Article  Google Scholar 

  49. Kim, D. J., Sun, X., Choi, J., Lin, S., Kweon, I. S.: Detecting human-object interactions with action co-occurrence priors. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12366 LNCS, pp. 718–736 (2020). https://doi.org/10.1007/978-3-030-58589-1_43.

  50. Lu, X., Wang, W., Shen, J., Crandall, D.J., Van Gool, L.: Segmenting objects from relational visual data. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 7885–7897 (2022). https://doi.org/10.1109/TPAMI.2021.3115815

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dinesh Kumar Vishwakarma.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, S., Vishwakarma, D.K. & Puri, N.K. A human activity recognition framework in videos using segmented human subject focus. Vis Comput (2024). https://doi.org/10.1007/s00371-023-03256-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-023-03256-4

Keywords

Navigation