Abstract
According to recent research, reconstructing high-precision 3D human body shape and pose using neural networks necessitates not just large datasets with ground-truth 3D annotations, but also depends significantly on sophisticated network structures to utilize spatial and temporal information. Employing these strategies will also make training more difficult and time-consuming. We proposed SimpleMeshNet, the simplest frame-based model to present, to estimate 3D human body mesh for in-the-wild images. On the one hand, the SimpleMeshNet contains just one fully connected layer after extracting the features and utilizing a pre-trained ResNet as a regressor to output the SMPL model parameters; on the other hand, it performed well and runs fairly fast. To minimize overfitting concerns when the ground-truth SMPL annotations are missing, SimpleMeshNet employs two different training strategies when training the network with or without ground-truth SMPL parameter annotations. Without bells and whistles, the network is quite easy to train and the results are highly convincing. In comparison to other methods, SimpleMeshNet's performance is measured using a video with five persons and an RTX3090 GPU. SimpleMeshNet alone can achieve 107 frames per second, whereas the whole system can get 45 frames per second while using YOLOv3-416 as a tracker. Compared with the leading algorithms, the performance of SimpleMeshNet can rival them, sometimes even better. What’s more, SimpleMeshNet can be used to process different in-the-wild images captured by a variety of devices: cell phones, monitors, cameras, and more.
Similar content being viewed by others
References
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019). https://doi.org/10.1109/cvpr.2019.00463
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5349–5358 (2019). https://doi.org/10.1109/iccv.2019.00545
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/cvpr.2018.00744
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016). https://doi.org/10.1109/cvpr.2016.90
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019). https://doi.org/10.1109/iccv.2019.00234
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263. (2020). https://doi.org/10.1109/cvpr42600.2020.00530
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005). https://doi.org/10.1145/1186822.1073207
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015). https://doi.org/10.1145/2816795.2818013
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019) https://doi.org/10.1109/cvpr.2019.01123
Tan, J.K.V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3d human body shape and pose prediction. In: British Machine Vision Conference (2017). https://doi.org/10.5244/c.31.15
Tung, H-Y.F., Tung, H-W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. arXiv Prepr. arXiv1712.01337 (2017)
Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D vision (3DV),pp. 484–494 (2018). https://doi.org/10.1109/3dv.2018.00062
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/cvpr.2018.00055
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European Conference on Computer Vision, pp. 561–578 (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016). https://doi.org/10.1109/cvpr.2016.533
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6050–6059 (2017). https://doi.org/10.1109/cvpr.2017.500
Guler, R.A., Kokkinos, I.: Holopose: holistic 3d human reconstruction in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10884–10894 (2019). https://doi.org/10.1109/cvpr.2019.01114
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. arXiv Prepr. arXiv2012.09760 (2020)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Jain, A.: Structure-aware and temporally coherent 3d human pose estimation. arXiv Prepr. arXiv1711.09250, 3(4):6. https://doi.org/10.1016/j.patrec.2019.05.020 (2017)
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018). https://doi.org/10.51202/9783186869104-40
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Theobalt, C.: Xnect: Real-time multi-person 3d human pose estimation with a single rgb camera. arXiv Prepr. arXiv1907.00837 (2019) https://doi.org/10.1145/3386569.3392410
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 36(4), 1–14 (2017). https://doi.org/10.1145/3072959.3073596
Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3395–3404 (2019). https://doi.org/10.1109/cvpr.2019.00351
Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P.V., Black, M.J.: Towards accurate marker-less human shape and pose estimation over time. In: 2017 International Conference on 3D vision (3DV), pp. 421–430 (2017). https://doi.org/10.1109/3dv.2017.00055
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019). https://doi.org/10.1109/cvpr.2019.00576
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Polosukhin, I.: Attention is all you need. arXiv Prepr. arXiv1706.03762 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645 (2016). https://doi.org/10.1109/cvpr.2018.00466
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3d human motion estimation via motion compression and refinement, In: Proceedings of the Asian Conference on Computer Vision (2020)
Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3d human pose and shape from a video, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1964–1973 (2021)
Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv Prepr. arXiv1907.02499 (2019)
Shanyan G., Jingwei X., Yunbo W., Bingbing N., Xiaokang Y.: Bilevel online adaptation for out-of-Domain human mesh reconstruction. arXiv Prepr. arXiv2013.16449. ECCV (2021)
Loper, M., Mahmood, N., Black, M.J.: MoSh: Motion and shape capture from sparse markers. ACM Trans. Graph. 33(6), 1–13 (2014). https://doi.org/10.1145/2661229/2661273
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), 601–617. 4, https://doi.org/10.1007/978-3-030-01249-6_37 (2018)
Hanbyul, J., Natalia, N., Andrea V.: Exemplar Fine-Tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. arXiv preprint arXiv:2004.03686 (2020)
Gyeongsik M., Kyoung M.L.: I2l-MeshNet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: ECCV. https://doi.org/10.1007/978-3-030-58571-6_44 (2020)
Hongsuk, C., Gyeongsik M., Kyoung M.L.: Pose2Mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: ECCV, pp. 769–787. https://doi.org/10.1007/978-3-030-58571-6_45 (2020)
Funding
This is funded by National Natural Science Foundation of China (11772053, Qinwei Ma, 11727801, Shaopeng Ma).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, W., Ma, S., He, X. et al. SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer. J Real-Time Image Proc 19, 703–713 (2022). https://doi.org/10.1007/s11554-022-01214-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-022-01214-2