Abstract
Human segmentation and tracking (HS-T) in the video often utilize person detection results. In addition, 3D human pose estimation (3D-HPE) and human activity recognition (HAR) often use human segmentation results to reduce data storage and computational time. With recent advantages of deep learning, especially using Convolutional Neural Networks (CNNs), there are excellent results in these relevant tasks. Consequently, they can be applied to building many practical applications such as sports analysis, sports scoring, health protection, teaching, and preserving traditional martial arts. In this paper, we performed a survey of relevant studies, methods, datasets, and results for HS-T, 3D-HPE, and HAR. We also deeply analyze the results of detecting persons as it affects the results of human segmentation and human tracking. The survey is performed in great detail up to source code paths. The MADS (Martial Arts, Dancing, and Sports) dataset comprises fast and complex activities. It has been published for the task of estimating human pose. However, before determining the human pose, the person needs to be detected as a segment in the video, especially the 3D human pose annotation data is different from the point cloud data generated from RGB-D images. Therefore, we have also prepared 2D human pose annotation data on the 28k images for creating 3D human pose annotation and action labeling data. Moreover, we also evaluated the MADS dataset with many recently published deep learning methods for human segmentation (Mask R-CNN, PointRend, TridentNet, TensorMask, and CenterMask) and tracking, 3D-HPE (RepNet, MediaPipe Pose, and Lifting from the Deep, V2V-PoseNet), and HAR (ST-GCN, DD-net, and PA-GesGCN) in the video. All data and published results are available.
Similar content being viewed by others
Notes
https://neurohive.io/en/popular-networks/vgg16/, [accessed on20 May 2021]
https://github.com/rbgirshick/fast-rcnn,[accessed on 25 May2021]
https://github.com/AlexeyAB/darknet, [accessed on, June, 2021]
https://github.com/weiliu89/caffe/tree/ssd, [accessed on12 June 2021]
https://github.com/matterport/Mask_RCNN, [accessed on, 14 June, 2021]
https://github.com/facebookresearch/detectron2, [accessed on, 14 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/DeepLab, [accessed on, 12 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/DensePose, [accessed on, 12 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/Panoptic-DeepLab, [accessed on, 14 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend, [accessed on, 14 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/TensorMask, [accessed on, 20 June, 2021]
https://github.com/facebookresearch/detectron2/tree/master/projects/TridentNet, [accessed on, 15 June, 2021]
https://github.com/youngwanLEE/CenterMask, [accessed on, 16 June, 2021]
https://github.com/scnuhealthy/Tensorflow_PersonLab, [accessed on, 16 June, 2021]
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/, [accessed on, 19 June, 2021]
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, [accessed on, 18 June, 2021]
https://github.com/JaviLaplaza/Pytorch-Siamese, [accessed on, 20 June, 2021]
http://web.archive.org/web/20110827170646/http://kspace.cdvp.dcu.ie/public/interactive-segmentation/index.html, [accessed~on,18April,2021]
https://drive.google.com/file/d/1Ssob496MJMUy3vAiXkC_ChKbp4gx7OGL/view?usp=sharing, [accessed on, 18 July, 2021]
https://github.com/duonglong289/detectron2, [accessed on, 10 June, 2021]
https://github.com/duonglong289/detectron2/tree/master/projects/PointRend, [accessed on, 15 June, 2021]
https://github.com/duonglong289/detectron2/tree/master/projects/TridentNet, [accessed on, 16 June, 2021]
https://github.com/duonglong289/detectron2/tree/master/projects/TensorMask, [accessed on, 16 June, 2021]
https://github.com/duonglong289/centermask2, [accessed on, 16 June, 2021]
References
Allaya N, Khabir A, Sallemi-Boudawara T, Sellami N, Daoud J, Ghorbel A, Frikha M, Gargouri A, Mokdad-Gargouri R, Ayadi W (2010) Action recognition based on a bag of 3D point. In: 2010 IEEE computer society conference on computer vision and pattern recognition - workshops, vol 36, pp 3807–3814. https://doi.org/10.1007/s13277-014-3022-6
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation new benchmark and state-of-the-art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Bazarevsky V, Zhang F (2020) BlazePose : on-device real-time body pose tracking. arXiv:2006.10204
Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP), pp 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003https://doi.org/10.1109/ICIP.2016.7533003
Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection
Burrus N (2011) Kinect calibration. http://nicolas.burrus.name/index.php/Research/KinectCalibration. Accessed 05 April 2021
Chahyati D, Fanany MI, Arymurthy AM (2017) Tracking people by detection using cnn features. In: Procedia computer science, vol 124, pp 167–172. Elsevier BV, https://doi.org/10.1016/j.procs.2017.12.143https://doi.org/10.1016/j.procs.2017.12.143
Chen X, Girshick R, He K, Dollár P (2019) Tensormask: a foundation for dense object segmentation
Chen W, Jiang Z, Ni HG, Fall X (2020) Detection based on key points of of human-skeleton using openpose. Symmetry
Chen X, Lin KY, Liu W, Qian C, Lin L (2019) Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 10,887–10,896. https://doi.org/10.1109/CVPR.2019.01115
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587
Chen CH, Ramanan D (2017) 3D human pose estimation = 2D pose estimation + matching. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 5759–5767. https://doi.org/10.1109/CVPR.2017.610
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13,339–13,348. https://doi.org/10.1109/ICCV48922.2021.01311
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV
Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2019) Panoptic-deeplab. In: ICCV COCO + Mapillary joint recognition challenge workshop
Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2020) Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR
Ciaparrone G, Luque sánchez F, Tabik S, Troiano L, Tagliaferri R, Herrera F (2020) Deep learning in video multi-object tracking: a survey. Neurocomputing 381:61–88. https://doi.org/10.1016/j.neucom.2019.11.023https://doi.org/10.1016/j.neucom.2019.11.023
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Adv Neural Inf Process Syst:379–387
Dang Q, Yin J, Wang B, Zheng W (2021) Deep learning based 2D human pose estimation: a survey. IEEE Trans Pattern Anal Mach Intell 24(6):663–676. https://doi.org/10.26599/TST.2018.9010100
Das S, Sharma S, Dai R, Brémond F, Thonnat M (2020) VPN: learning video-pose embedding for activities of daily living. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12354 LNCS, pp 72–90. https://doi.org/10.1007/978-3-030-58545-7_5
Ding Z, Wang P, Ogunbona PO, Li W (2017) Investigation of different skeleton features for CNN-based 3D action recognition. In: 2017 IEEE international conference on multimedia and expo workshops, ICMEW 2017, pp 617–622. https://doi.org/10.1109/ICMEW.2017.8026286
Ding X, Yang K, Chen W (2019) An attention-enhanced recurrent graph convolutional network for skeleton-based action recognition. ACM Int Conf Proc Series:79–84, https://doi.org/10.1145/3372806.3372814
Duan H, Wang J, Chen K, Lin D (2022) PYSKL: towards good practices for skeleton action recognition. arXiv:2205.09443
Duan H, Zhao Y, Chen K, Lin D, Dai B (2021) Revisiting skeleton-based action recognition. arXiv:2104.13586, (1)
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2007) The pascal visual object classes challenge 2007 results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html. Accessed 05 April 2021
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes challenge 2010 results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html. Accessed 05 April 2021
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The pascal visual object classes challenge 2012 results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html. Accessed 05 April 2021
Fang HS, Xu Y, Wang W, Liu X, Zhu SC (2018) Learning pose grammar to encode human body configuration for 3D pose estimation. In: Thirty-second AAAI conference on artificial intelligence
Georgakis G, Li R, Karanam S, Chen T, Košecká J, Wu Z (2020) Hierarchical kinematic human mesh recovery. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12362 LNCS, pp 768–784. https://doi.org/10.1007/978-3-030-58520-4_45
(2019). Geeks forgeeks: linear regression (python implementation). https://www.geeksforgeeks.org/linear-regression-python-implementation/,. Accessed 4 April 2019
(2019). Geometric: geometric transformations. https://pages.mtu.edu/~shene/COURSES/cs3621/NOTES/geometry/geo-tran.html. Accessed 4 April 2019
Girshick R (2015) fast r-CNN. In: Proceedings of the IEEE international conference on computer vision, vol 2015 Inter, pp 1440–1448. https://doi.org/10.1109/ICCV.2015.169
Girshick R, Donahue J, Darrell T, Berkeley UC, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, vol 1, p 5000. https://doi.org/10.1109/CVPR.2014.81
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Gruosso M, Capece N, Erra U (2020) Human segmentation in surveillance video with deep learning. Multimed Tools Appl
Haq EU, Jianjun H, Li K, Haq HU (2020) Human detection and tracking with deep convolutional neural networks under the constrained of noise and occluded scenes. Multimed Tools Appl 79(41-42):30,685–30,708. https://doi.org/10.1007/s11042-020-09579-x
Haque MF, Lim HY, Kang DS (2019) Object detection based on vgg with resnet network. In: 2019 International conference on electronics, information, and communication (ICEIC). Institute of electronics and information engineers (IEIE), pp 1–3
Harshall L (2019) Understanding semantic segmentation with unet, https://towardsdatascience.com/understanding-semantic-segmentation-with/-unet-6be4f42d4b47. Accessed 4 January 2021
He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-CNN. In: ICCV
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916. https://doi.org/10.1109/TPAMI.2015.2389824
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, 27-30 June 2016. IEEE computer society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Helten T, Baak A, Bharaj G, Muller M, Seidel HP, Theobalt C (2013) Personalization and evaluation of a real-time depth-based full body tracker. In: Proceedings - 2013 international conference on 3D vision, 3DV 2013, pp 279–286. https://doi.org/10.1109/3DV.2013.44
Hossain MRI, Little JJ (2018) Exploiting temporal information for 3D human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11214 LNCS, pp 69–86. https://doi.org/10.1007/978-3-030-01249-6_5
Hu G, Cui B, Yu S (2019) Skeleton-based action recognition with synchronous local and non-local spatio-temporal learning and frequency attention. In: Proceedings - IEEE international conference on multimedia and expo, vol 2019-July, pp 1216–1221. https://doi.org/10.1109/ICME.2019.00212
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 3296–3305. https://doi.org/10.1109/CVPR.2017.351
Hung GL, Sahimi MSB, Samma H, Almohamad TA, Lahasan B (2020) Faster R-CNN deep learning model for pedestrian detection from drone images. In: SN computer science. Springer Singapore, vol 1, pp 1–9. https://doi.org/10.1007/s42979-020-00125-y
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339
Iskakov K, Burkov E, Lempitsky VS, Malkov Y (2019) Learnable triangulation of human pose. CoRR arXiv:1905.05754
Jen-Kai T, Chen-Chien H, Wei-Yen W, Shao-Kang H (2020) Deep learning-based real-time multiple-person action recognition system sensors. https://doi.org/10.3390/s20174758
Ji X, Fang Q, Dong J, Shuai Q, Jiang W, Zhou X (2020) A survey on monocular 3D human pose estimation. Virtual Reality and Intelligent Hardware 2(6):471–500. https://doi.org/10.1016/j.vrih.2020.04.005
Jocher G (2021) Head and person detection model, https://github.com/deepakcrk/yolov5-crowdhuman. Accessed 6 Dec 2021
Jonathan L, Evan S, Trevor D (2015) Fully convolutional networks for semantic segmentation. In: Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Khan G, Tariq Z, Usman Ghani Khan M (2019) Multi-Person tracking based on faster R-CNN and deep appearance features. Vis Object Tracking Deep Neural Netw:1–23, https://doi.org/10.5772/intechopen.85215https://doi.org/10.5772/intechopen.85215
Kim BG, Park DJ (2004) Unsupervised video object segmentation and tracking based on new edge features. Pattern Recognit Lett (Elsevier) 25:1731–1742. https://doi.org/10.1016/j.patrec.2004.07.009
Kirillov A, Wu Y, He K, Girshick R (2019) Pointrend: image segmentation as rendering
Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3D human pose using multi-view geometry. In: IEEE computer vision and pattern recognition, arXiv:1903.02330
Kong Y, Fu Y (2022) Human action recognition and prediction: a survey. Int J Comput Vis 130(5):1366–1401. https://doi.org/10.1007/s11263-022-01594-9
Krizhevsky A, Sutskever I, Hinton GE (2012) Handbook of approximation algorithms and metaheuristics. In: NIPS’12: proceedings of the 25th international conference on neural information processing systems, pp 1–1432. https://doi.org/10.1201/9781420010749
Kundu JN, Seth S, Rahul MV, Rakesh M, Babu RV, Chakraborty A (2020) Kinematic-structure-preserved representation for unsupervised 3d human pose estimation. In: AAAI 2020 - 34Th AAAI conference on artificial intelligence, pp 11,312–11,319. https://doi.org/10.1609/aaai.v34i07.6792
Laplaza Galindo J (2018) Tracking and approaching people using deep learning techniques. In: A thesis presented for the degree of master universitari en enginyeria industrial, september
Leal-Taixe L, Milan A, Reid I, Roth S, Schindler K (2015) MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942 pp 1–15
Lee Y, Hwang JW, Lee S, Bae Y, Park J (2019) An energy and gpu-computation efficient backbone network for real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Lee K, Lee I, Lee S (2018) Propagating LSTM: 3D pose estimation based on joint interdependency. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11211 LNCS, pp 123–141. https://doi.org/10.1007/978-3-030-01234-2_8
Lee Y, Park J (2020) Centermask: real-time anchor-free instance segmentation. In: CVPR
Li S, Chan AB (2014) 3D human pose estimation from monocular images with deep convolutional neural network. In: Asian conference on computer vision. https://doi.org/10.1007/978-3-319-16808-1_23
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection
Li C, Hee Lee G (2019) Generating multiple hypotheses for 3d human pose estimation with mixture density network. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Li C, Lee GH (2019) Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). arXiv:1904.05547
Li W, Liu H, Ding R, Liu M, Wang P, Yang W (2022) exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans Multimed:1–13, https://doi.org/10.1109/TMM.2022.3141231
Li Y, Xia R, Liu X, Huang Q (2019) Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In: Proceedings - IEEE international conference on multimedia and expo, vol 2019-July, pp 1066–1071. https://doi.org/10.1109/ICME.2019.00187
Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2021) Memory attention networks for skeleton-based action recognition. IEEE Trans Neural Netw Learn Syst:1639–1645, https://doi.org/10.1109/TNNLS.2021.3061115
Li M, Yu C, Wang X (2020) Skeleton-based action recognition with a triple-stream graph convolutional network. In: ACM international conference proceeding series, pp 524–528. https://doi.org/10.1145/3443467.3443809
Li S, Zhang W, Chan AB (2017) Maximum-margin structured learning with deep networks for 3D human pose estimation. Int J Comput Vis 122 (1):149–168. https://doi.org/10.1007/s11263-016-0962-x
Liang D, Fan G, Lin G, Chen W, Pan X, Zhu H (2019) Three-stream convolutional neural network with multi-task and ensemble learning for 3d action recognition. In: IEEE Computer society conference on computer vision and pattern recognition workshops, vol 2019-june, pp 934–940. https://doi.org/10.1109/CVPRW.2019.00123
Liefeng B, Cristian S (2010) Twin gaussian processes for structured prediction. Int J Comput Vis, vol 87
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 8693 LNCS, pp 740–755
(2019). Linear: linear regression, https://machinelearningcoban.com/2016/12/28/linearregression/. Accessed 4 April 2019
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, vol 9905 LNCS, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Liu F, Dai Q, Wang S, Zhao L, Shi X, Qiao J (2020) Multi-relational graph convolutional networks for skeleton-based action recognition. In: Proceedings - 2020 IEEE international symposium on parallel and distributed processing with applications, pp 474–480. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00085
Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC (2020) NTU RGB+d 120: a large-scale benchmark for 3D human activity understanding. In: IEEE transactions on pattern analysis and machine intelligence, vol 42, pp 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 140–149. https://doi.org/10.1109/CVPR42600.2020.00022
Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 2659–2668. https://doi.org/10.1109/ICCV.2017.288
Mehta D, Rhodin H, Casas D, Fua P, Sotnychenko O, Xu W, Theobalt C (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 fifth international conference on 3D vision (3DV)
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel HP, Xu W, Casas D, Theobalt C (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. http://gvv.mpi-inf.mpg.de/projects/VNect/. Accessed 05 April 2021
Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October, pp 10,132–10,141. https://doi.org/10.1109/ICCV.2019.01023
Neverova N, Novotny D, Vedaldi A (2019) Correlated uncertainty for learning dense correspondences from noisy labels
Nibali A, He Z, Morgan S, Prendergast L (2019) 3D human pose estimation with 2D marginal heatmaps. In: Proceedings - 2019 IEEE winter conference on applications of computer vision, WACV 2019, Figure 1, pp 1477–1485. https://doi.org/10.1109/WACV.2019.00162
Nie Q, Liu Z, Liu Y (2020) Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12364 LNCS, pp 102–118. https://doi.org/10.1007/978-3-030-58529-7_7
Nie BX, Wei P, Zhu SC (2017) Monocular 3D human pose estimation by predicting depth on joints. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 3467–3475. https://doi.org/10.1109/ICCV.2017.373
Omran M, Lassner C, Pons-Moll G, Gehler P, Schiele B (2018) Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: Proceedings - 2018 international conference on 3D vision, 3DV 2018, pp 484–494. https://doi.org/10.1109/3DV.2018.00062
Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 716–723. https://doi.org/10.1109/CVPR.2013.98
Papandreou G, Zhu T, Chen LC, Gidaris S, Tompson J, Murphy K (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: ECCV
Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 1263–1272. https://doi.org/10.1109/CVPR.2017.139
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d Human pose estimation in video with temporal convolutions and semi-supervised training. In: Conference on computer vision and pattern recognition (CVPR)
Pavllo D, Grangier D, Auli M (2018) Quaternet: a quaternion-based recurrent model for human motion. In: British machine vision conference (BMVC)
Qin Z, Liu Y, Ji P, Kim D, Wang L, McKay B, Anwar S, Gedeon T (2021) Fusing higher-order features in graph neural networks for skeleton-based action recognition. arXiv:2105.01563 pp 1–15
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Computer vision and pattern recognition
Redmon J, Farhadi A (2016) Yolo9000: better, faster, stronger. arXiv:1612.08242
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28, pp 91–99
Ren B, Liu M, Ding R, Liu H (2020) A survey on 3d skeleton-based action recognition using learning method. arXiv:2002.05907, pp 1–8
Renuka J (2021) Accuracy, precision, recall and f1 score: interpretation of performance measures. Accessed 4 January 2016
Rhodin H, Constantin V, Katircioglu I, Salzmann M, Fua P (2019) Neural scene decomposition for multi-person motion capture. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 7695–7705. https://doi.org/10.1109/CVPR.2019.00789
Rhodin H, Salzmann M, Fua P (2018) Unsupervised geometry-aware representation for 3D human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 11214 LNCS, pp 765–782. https://doi.org/10.1007/978-3-030-01249-6_46
Riza Alp Guler Natalia Neverova IK (2018) Densepose: dense human pose estimation in the wild
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Sanchez S, Romero H, Morales A (2020) A review: comparison of performance metrics of pretrained models for object detection using the tensorflow framework. In: IOP Conference series materials science and engineering
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: CVPR
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+d: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016-December, pp 1010–1019. https://doi.org/10.1109/CVPR.2016.115
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) CrowdHuman: a benchmark for detecting human in a crowd. arXiv:1805.00123, pp 1–9
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 7904–7913. https://doi.org/10.1109/CVPR.2019.00810
Sigal L, Balan AO, Black MJ (2010) HUMAN EVA : synchronized video and motion capture dataset human motion. Int J Comput Vis 87(1):4–27. https://doi.org/10.1007/s11263-009-0273-6
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations, ICLR 2015 - conference track proceedings, pp 1–14
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Singh M, Basu A, Mandal MK (2008) Human activity recognition based on silhouette directionality. IEEE Trans Circuits Syst Video Technol 18 (9):1280–1292. https://doi.org/10.1109/TCSVT.2008.928888
Singh M, Mandai M, Basu A (2005) Pose recognition using the radon transform. Midwest Symposium on Circuits Syst 2005:1091–1094. https://doi.org/10.1109/MWSCAS.2005.1594295
Song L, Yu G, Yuan J, Liu Z (2021) Journal of visual communication and image representation human pose estimation and its application to action recognition : a survey. J Vis Commun Image Representation 76:103,055. https://doi.org/10.1016/j.jvcir.2021.103055
Song YF, Zhang Z, Shan C, Wang L (2020) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: MM 2020 - proceedings of the 28th ACM international conference on multimedia, pp 1625–1633. https://doi.org/10.1145/3394171.3413802
Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. Proc Int Conf Image Process ICIP 2019:1–5. https://doi.org/10.1109/ICIP.2019.8802917
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Eccv
Tekin B, Katircioglu I, Salzmann M, Lepetit V, Fua P (2016) Structured prediction of 3D human pose with deep neural networks. In: British machine vision conference 2016, BMVC 2016, vol 2016-september, pp 130.1–130.11. https://doi.org/10.5244/C.30.130
Tekin B, Marquez-Neila P, Salzmann M, Fua P (2017) learning to fuse 2D and 3D image cues for monocular body pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 3961–3970. https://doi.org/10.1109/ICCV.2017.425
Thanh NT, Húng LV, Công PT (2019) An evaluation of pose estimation in video of traditional martial arts presentation. J Res Develop Inf Commun Technol 2019(2):114–126. https://doi.org/10.32913/mic-ict-research.v2019.n2.864
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: Proceeding international conference computer vision (ICCV)
Tian Z, Shen C, Chen H, He T (2021) FCOS: a simple and strong anchor-free object detector
Tome D, Russell C, Agapito L (2017) Lifting from the deep: convolutional 3d pose estimation from a single image. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Tome D, Russell C, Agapito L (2017) Lifting from the deep: convolutional 3D pose estimation from a single image. In: Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, pp 5689–5698. https://doi.org/10.1109/CVPR.2017.603
Véges M, Varga V, Lő rincz A (2018) 3d human pose estimation with siamese equivariant embedding. arXiv:1809.07217
Wandt B, Rosenhahn B (2019) Repnet: weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In: Computer vision and pattern recognition (CVPR)
Wandt B, Rosenhahn B (2019) Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. CoRR arXiv:1902.09868
Wang H (2017) Detection of humans in video streams using convolutional neural networks. Degree Project Compu Sci Eng
Wang L, Chen Y, Guo Z, Qian K, Lin M, Li H, Ren JS (2019) Generalizing monocular 3d human pose estimation in the wild. arXiv:1904.05512
Wang J, Huang S, Wang X, Tao D (2019) Not all parts are created equal: 3D pose estimation by modeling bi-directional dependencies of body parts. In: Proceedings of the IEEE international conference on computer vision, vol 2019-Octob, pp 7770–7779. https://doi.org/10.1109/ICCV.2019.00786
Wang K, Lin L, Jiang C, Qian C, Wei P (2019) 3d Human pose machines with self-supervised learning. IEEE Trans Pattern Anal Mach Intell
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1290–1297. https://doi.org/10.1109/CVPR.2012.6247813
Wang J, Tan S, Zhen X, Xu S, Zheng F, He Z, Shao L (2021) Deep 3d human pose estimation: a review. Comput Vis Image Understand, p 103225
Wang Y, Wang T (2020) Cycle fusion network for multi-person pose estimation. J Phys Conf Series, vol 1550(3)
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Wang X, Zhong Y, Jin L, Xiao Y (2019) Scale adaptive graph convolutional network for skeleton-based action recognition. In: CVPR19, vol 55, pp 306–312. https://doi.org/10.11784/tdxbz202012073
Watada J, Musa Z, Jain LC, Fulcher J (2010) Human tracking: a state-of-art survey. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 6277 LNAI, pp 454–463. https://doi.org/10.1007/978-3-642-15390-7_47
Willett NS, Shin HV, Jin Z, Li W, Finkelstein A (2020) Pose2Pose: pose selection and transfer for 2d character animation. In: International conference on intelligent user interfaces, proceedings IUI, pp 88–99. https://doi.org/10.1145/3377325.3377505
Wojke N, Bewley A (2018) Deep cosine metric learning for person re-identification. In: 2018 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 748–756. https://doi.org/10.1109/WACV.2018.00087
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International conference on image processing (ICIP). IEEE, pp 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962
Wu Y, Kirillov A, Massa F, Lo WY, Girshick R (2019) Detectron2. https://github.com/facebookresearch/detectron2. Accessed 05 April 2021
Xu Y, Cheng J, Wang L, Xia H, Liu F, Tao D (2018) Ensemble one-dimensional convolution neural networks for skeleton-based action recognition. IEEE Signal Process Lett 25(7):1044–1048. https://doi.org/10.1109/LSP.2018.2841649
Xu J, Wang R, Rakheja V (2019) Literature Review: human segmentation with static camera. arXiv:1910.12945v1, pp 1–11
Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3D human pose estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 896–905. https://doi.org/10.1109/CVPR42600.2020.00098
Xu Y, Zhou X, Chen S, Li F (2019) Deep learning for multiple object tracking: a survey. IET Comput Vis 13(4):411–419. https://doi.org/10.1049/iet-cvi.2018.5598
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. 32nd AAAI Conf Artif Intell AAAI vol 2018, pp 7444–7452
Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: 1st ACM international conference on multimedia in asia, MMAsia 2019, vol 15, pp 1–6. https://doi.org/10.1145/3338533.3366569
Yao R, Lin G, Xia S, Zhao J, Zhou Y (2019) Video object segmentation and tracking: a survey vol 1(1)
Ye M, Shen Y, Du C, Pan Z, Yang R (2016) Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. IEEE Trans Pattern Anal Mach Intell 38(8):1517–1532. https://doi.org/10.1109/TPAMI.2016.2557783
Yuan Y, Chu J, Leng L, Miao J, Kim BG (2020) A scale-adaptive object-tracking algorithm with occlusion detection. EURASIP J Image Video Process (Springer)
Zeng A, Sun X, Yang L, Zhao N, Liu M, Xu Q (2021) Learning skeletal graph neural networks for hard 3D pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 11,416–11,425. https://doi.org/10.1109/ICCV48922.2021.01124
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978. https://doi.org/10.1109/TPAMI.2019.2896631
Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. Proc IEEE Comput Society conf Comput Vis Pattern recognit:1109–1118. https://doi.org/10.1109/CVPR42600.2020.00119
Zhang SH, Li R, Dong X, Rosin P, Cai Z, Han X, Yang D, Huang H, Hu SM (2019) Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 889–898. https://doi.org/10.1109/CVPR.2019.00098
Zhang Z, Liu S, Liu S, Han L, Shao Y, Zhou W (2015) Human action recognition using salient region detection in complex scenes. Lecture Notes Electr Eng 322:565–572. https://doi.org/10.1007/978-3-319-08991-1_58
Zhang W, Liu Z, Zhou L, Leung H, Chan AB (2017) Martial arts, dancing and sports dataset: a challenging stereo and multi-view dataset for 3D human pose estimation. Image Vis Comput, vol 61. https://doi.org/10.1016/j.imavis.2017.02.002
Zhang H, Sciutto C, Agrawala M, Fatahalian K (2021) Vid2Player: controllable video sprites that behave and appear like professional tennis players. ACM Trans Graph 40(3):1–16. https://doi.org/10.1145/3448978
Zhang W, Shang L, Chan AB (2014) a robust likelihood function for 3D human pose tracking. IEEE Trans Image Process 23(12):5374–5389
Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of vision-based human action recognition methods. Sensors (Switzerland) 19(5):1–20. https://doi.org/10.3390/s19051005
Zhang X, Zou J, He K, Sun J (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Trans Pattern Anal Mach Intell 38(10):1943–1955. https://doi.org/10.1109/TPAMI.2015.2502579
Zhao L, Peng X, Tian Y, Kapadia M, Metaxas DN (2019) Semantic graph convolutional networks for 3D human pose regression. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2019-June, pp 3420–3430. https://doi.org/10.1109/CVPR.2019.00354
Zheng C, Wu W, Chen C, Yang T, Zhu S, Shen J, Kehtarnavaz N, Shah M (2018) Deep learning-based human pose estimation : a survey. J ACM, vol 37(4)
Zheng C, Zhu S, Mendieta M, Yang T, Chen C, Ding Z (2021) 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE international conference on computer vision (ICCV), vol 1. arXiv:2103.10455
Zhou K, Han X, Jiang N, Jia K, Lu J (2019) HEMlets pose: learning part-centric heatmap triplets for accurate 3D human pose estimation. In: Proceedings of the IEEE international conference on computer vision, vol 2019-October, pp 2344–2353. https://doi.org/10.1109/ICCV.2019.00243
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision, vol 2017-October, pp 398–407. https://doi.org/10.1109/ICCV.2017.51
Zhu J, Zou W, Xu L, Hu Y, Zhu Z, Chang M, Huang J, Huang G, Du D (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770
Funding
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.01-2019.315.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The article is an author’s own survey, not related to any organization or individual. It is part of a series of studies on 3D human pose estimation and human activity recognition in 3D space.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Le, VH. Deep learning-based for human segmentation and tracking, 3D human pose estimation and action recognition on monocular video of MADS dataset. Multimed Tools Appl 82, 20771–20818 (2023). https://doi.org/10.1007/s11042-022-13921-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13921-w