A human activity recognition framework in videos using segmented human subject focus

Gupta, Shaurya; Vishwakarma, Dinesh Kumar; Puri, Nitin Kumar

doi:10.1007/s00371-023-03256-4

A human activity recognition framework in videos using segmented human subject focus

Original article
Published: 06 February 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Shaurya Gupta¹,
Dinesh Kumar Vishwakarma ORCID: orcid.org/0000-0002-1026-0047² &
Nitin Kumar Puri¹

113 Accesses
Explore all metrics

Abstract

Automating tasks through human activity recognition in video data has become increasingly vital. Deep learning has yielded versatile activity recognition systems applicable in surveillance, healthcare analysis, sports, and human–computer interaction. Despite various proposed video-based activity recognition techniques over the years, the reliance over RGB frames, accompanied by other modalities like joint locations and depth maps, often proves less effective compared to multimodal methods. In response to this challenge, our paper introduces a competitive approach for identifying human activity in video frames. Leveraging a Convolutional Long Short-Term Memory (Conv-LSTM) network and a novel pre-processing step involving a Human Segmentation network, our method accentuates human subjects in each frame using segmentation maps. These highlighted frames undergo further processing through Convolutional Neural Networks (CNNs) to learn feature vectors, without including other modalities with RGB frames directly. The learned features are then subjected to Long Short-Term Memory (LSTM) units for comprehending sequential video data and drawing meaningful inferences. The proposed methodology undergoes rigorous testing on three publicly available datasets—KARD, MSR Daily Activity, and SBU-Interactions. Remarkably, our approach outperforms similar state-of-the-art methods, achieving benchmark accuracy scores exceeding 98% on MSR Daily Activity and 99% on KARD and SBU-Interactions datasets. In essence, our method not only provides a competitive solution for human activity recognition in video frames but also contributes to advancing the field by integrating Conv-LSTM networks and innovative pre-processing techniques. The comprehensive evaluation on multiple datasets underlines the robustness and superior performance of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability statement

Datasets used in this work are available in public repository. https://data.mendeley.com/datasets/k28dtm7tr6/1, https://sites.google.com/view/wanqingli/data-sets/msr-dailyactivity3d, https://cove.thecvf.com/datasets/57#:~:text=A%20complex%20human%20activity%20dataset,depth%20and%20motion%20capture%20data.

References

Karpathy, A., Toderici, G., Shetty S., Leung, T., Sukthankar, R., Li, F. F.: Large-scale video classification with convolutional neural networks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223.
Zeng, M., et al.: Convolutional Neural Networks for human activity recognition using mobile sensors. In: Proc. 2014 6th Int. Conf. Mob. Comput. Appl. Serv. MobiCASE 2014, vol. 6, pp. 197–205, (2015). https://doi.org/10.4108/icst.mobicase.2014.257786.
Dhiman, C., Vishwakarma, D. K.: A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Appl. Artif. Intell., 77( June 2018), 21–45 (2019). https://doi.org/10.1016/j.engappai.2018.08.014.
Dhiman, C., Vishwakarma, D. K., Agarwal, P.: Part-wise spatiooral attention driven CNN-based 3D human action recognition. ACM Trans. Multimed. Comput. Commun. Appl., 17(3) (2021). https://doi.org/10.1145/3441628.
Dhiman, C., Vishwakarma, D. K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020). https://doi.org/10.1109/TIP.2020.2965299.
Vishwakarma, D.K., Kapoor, R.: Hybrid classifier based human activity recognition using the silhouette and cells. Expert Syst. Appl. 42(20), 6957–6965 (2015). https://doi.org/10.1016/j.eswa.2015.04.039
Article Google Scholar
Straka, M., Hauswiesner, S., Rüther, M., Bischof, H.: Skeletal graph based human pose estimation in real-time. In: BMVC,: Proc. Br. Mach. Vis. Conf. 2011, 2011 (2011). https://doi.org/10.5244/C25.69
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural. Inf. Process. Syst. 1(January), 568–576 (2014)
Google Scholar
Jain, A., Vishwakarma, D. K.: State-of-the-arts violence detection using ConvNets. In: Proc. 2020 IEEE Int. Conf. Commun. Signal Process. ICCSP 2020, pp. 813–817, (2020). https://doi.org/10.1109/ICCSP48568.2020.9182433.
Yadav, A., Vishwakarma, D.K.: A deep learning architecture of RA-DLNet for visual sentiment analysis. Multimed. Syst. 26(4), 431–451 (2020). https://doi.org/10.1007/s00530-020-00656-7
Article Google Scholar
Sabih, M., Vishwakarma, D. K.: Crowd anomaly detection with LSTMs using optical features and domain knowledge for improved inferring. Vis. Comput., no. 0123456789, (2021). https://doi.org/10.1007/s00371-021-02100-x.
Jeevan, M., Jain, N., Hanmandlu, M., Chetty, G.: Gait recognition based on gait pal and pal entropy image. In: 2013 IEEE Int. Conf. Image Process. ICIP 2013 - Proc., pp. 4195–4199 (2013). https://doi.org/10.1109/ICIP.2013.6738864.
Huang, W., Zhang, L., Wu, H., Min, F., Song, A.: Channel-equalization-HAR: a light-weight convolutional neural network for wearable sensor based human activity recognition. IEEE Trans. Mob. Comput. 22(9), 5064–5077 (2023). https://doi.org/10.1109/TMC.2022.3174816
Article Google Scholar
Wenbo, H., Zhang, L., Wang, S., Wu, H., Song, A.: Deep ensemble learning for human activity recognition using wearable sensors via filter activation. ACM Trans. Embed. Comput. Syst. 22(1), 1–23 (2022). https://doi.org/10.1145/3551486
Article Google Scholar
Garcia, N. C., Morerio, P., Murino, V.: Modality distillation with multiple stream networks for action recognition. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11212 LNCS, pp. 106–121 (2018). https://doi.org/10.1007/978-3-030-01237-3_7.
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: 32nd AAAI Conf. Artif. Intell. AAAI 2018, 7444–7452 (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59
Article PubMed Google Scholar
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Proc. Second Int. Conf. Hum. Behav. Underst., vol. 34, no. 8, 2012.
Yan, G., Hua, M., Zhong, Z.: Multi-derivative physical and geometric convolutional embedding networks for skeleton-based action recognition[Formula presented]. Comput. Aided Geom. Des., 86 (2021). https://doi.org/10.1016/j.cagd.2021.101964.
Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33(1), 469–485 (2021). https://doi.org/10.1007/s00521-020-05018-y
Article CAS Google Scholar
Zebhi, S., AlModarresi, S.M.T., Abootalebi, V.: Human activity recognition using pre-trained network with informative templates. Int. J. Mach. Learn. Cybern. 12(12), 3449–3461 (2021). https://doi.org/10.1007/s13042-021-01383-9
Article Google Scholar
Fei-Fei, L., Deng, J., Li, K.: ImageNet: constructing a large-scale image database. J. Vis. 9(8), 1037–1037 (2010). https://doi.org/10.1167/9.8.1037
Article Google Scholar
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S. W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access, 6, 1155–1166 (2017). https://doi.org/10.1109/ACCESS.2017.2778011.
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q.: Densely connected convolutional networks. In: Proc. 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017, pp. 2261–2269 (2017). https://doi.org/10.1109/CVPR.2017.243.
Weng, W., Zhu, X.: UNet: convolutional networks for biomedical image segmentation. IEEE Access 9, 16591–16603 (2015). https://doi.org/10.1109/ACCESS.2021.3053408
Article Google Scholar
“Supervisely Person Dataset - Datasets - Supervisely.” https://supervise.ly/explore/projects/supervisely-person-dataset-23304/datasets (accessed May 30, 2021).
Tan, M., Le, Q. V.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: 36th Int. Conf. Mach. Learn. ICML 2019, vol. 2019-June, pp. 10691–10700 (2019).
He, K., Zhang, S., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308.
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1800–1807 (2017). https://doi.org/10.1109/CVPR.2017.195.
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: 31st AAAI Conf. Artif. Intell. AAAI 2017, 4278–4284 (2017)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L. C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474.
Hu, J.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745.
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 1780, 1735–1780, 1997, [Online]. https://doi.org/10.1162/neco.1997.9.8.1735.
Schuster, M., Paliwal, K. K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). [Online]. Available: https://doi.org/10.1109/78.650093.
Gaglio, S., Lo Re, G., Member, S., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Human-Mach. Syst. 45(5), 586–597 (2015). https://doi.org/10.1109/THMS.2014.2377111.
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012, pp. 1290–1297 (2012). https://doi.org/10.1109/CVPR.2012.6247813.
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 28–35 (2012). https://doi.org/10.1109/CVPRW.2012.6239234.
Ashwini, K., Amutha, R.: Compressive sensing based recognition of human upper limb motions with kinect skeletal data. Multimed. Tools Appl. 80(7), 10839–10857 (2021). https://doi.org/10.1007/s11042-020-10327-4
Article Google Scholar
El Madany, N.E.D., He, Y., Guan, L.: Integrating entropy skeleton motion maps and convolutional neural networks for human action recognition. In: Proc. IEEE Int. Conf. Multimed. Expo (2018). https://doi.org/10.1109/ICME.2018.8486480
Cippitelli, E., Gasparrini, S., Gambi, E., Spinsante, S.: A human activity recognition system using skeleton data from RGBD sensors. Comput. Intell. Neurosci. (2016). https://doi.org/10.1155/2016/4351435.
Dhiman, C., Vishwakarma, D.K.: A robust framework for abnormal human action recognition using\boldsymbol{\mathcal{r}} -transform and zernike moments in depth videos. IEEE Sens. J. 19(13), 5195–5203 (2019). https://doi.org/10.1109/JSEN.2019.2903645
Article ADS Google Scholar
Andrade-Ambriz, Y.A., Ledesma, S., Ibarra-Manzano, M.A., Oros-Flores, M.I., Almanza-Ojeda, D.L.: Human activity recognition using temporal convolutional neural network architecture. Expert Syst. Appl. 191(December), 2022 (2021). https://doi.org/10.1016/j.eswa.2021.116287
Article Google Scholar
Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1045–1058 (2018). https://doi.org/10.1109/TPAMI.2017.2691321
Article PubMed Google Scholar
Huynh-The, T., et al.: Hierarchical topic modeling with pose-transition feature for action recognition using 3D skeleton data. Inf. Sci. (Ny) 444, 20–35 (2018). https://doi.org/10.1016/j.ins.2018.02.042
Article MathSciNet ADS Google Scholar
Zhu, J. et al.: Action machine: rethinking action recognition in trimmed videos. (2018). [Online]. Available: http://arxiv.org/abs/1812.05770.
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: 31st AAAI Conf. Artif. Intell. AAAI, 2017, 4263–4270 (2017)
Elboushaki, A., Hannane, R., Afdel, K., Koutti, L.: MultiD-CNN: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image sequences. Expert Syst. Appl. 139, 112829 (2020). https://doi.org/10.1016/j.eswa.2019.112829
Article Google Scholar
Kim, D. J., Sun, X., Choi, J., Lin, S., Kweon, I. S.: Detecting human-object interactions with action co-occurrence priors. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12366 LNCS, pp. 718–736 (2020). https://doi.org/10.1007/978-3-030-58589-1_43.
Lu, X., Wang, W., Shen, J., Crandall, D.J., Van Gool, L.: Segmenting objects from relational visual data. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 7885–7897 (2022). https://doi.org/10.1109/TPAMI.2021.3115815
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Physics, Delhi Technological University, Delhi, India
Shaurya Gupta & Nitin Kumar Puri
Department of Information Technology, Delhi Technological University, Delhi, India
Dinesh Kumar Vishwakarma

Authors

Shaurya Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Kumar Vishwakarma
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Kumar Puri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dinesh Kumar Vishwakarma.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gupta, S., Vishwakarma, D.K. & Puri, N.K. A human activity recognition framework in videos using segmented human subject focus. Vis Comput (2024). https://doi.org/10.1007/s00371-023-03256-4

Download citation

Accepted: 24 December 2023
Published: 06 February 2024
DOI: https://doi.org/10.1007/s00371-023-03256-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A human activity recognition framework in videos using segmented human subject focus

Abstract

Access this article

Similar content being viewed by others

Convolutional neural network: a review of models, methodologies and applications to object detection

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A human activity recognition framework in videos using segmented human subject focus

Abstract

Access this article

Similar content being viewed by others

Convolutional neural network: a review of models, methodologies and applications to object detection

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Data availability statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation