SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

Sun, Wenzhang; Ma, Shaopeng; He, Xuanfang; Ma, Qinwei

doi:10.1007/s11554-022-01214-2

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

Original Research Paper
Published: 28 April 2022

Volume 19, pages 703–713, (2022)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Wenzhang Sun¹,
Shaopeng Ma²,
Xuanfang He¹ &
…
Qinwei Ma ORCID: orcid.org/0000-0003-2792-8389¹

349 Accesses
1 Altmetric
Explore all metrics

Abstract

According to recent research, reconstructing high-precision 3D human body shape and pose using neural networks necessitates not just large datasets with ground-truth 3D annotations, but also depends significantly on sophisticated network structures to utilize spatial and temporal information. Employing these strategies will also make training more difficult and time-consuming. We proposed SimpleMeshNet, the simplest frame-based model to present, to estimate 3D human body mesh for in-the-wild images. On the one hand, the SimpleMeshNet contains just one fully connected layer after extracting the features and utilizing a pre-trained ResNet as a regressor to output the SMPL model parameters; on the other hand, it performed well and runs fairly fast. To minimize overfitting concerns when the ground-truth SMPL annotations are missing, SimpleMeshNet employs two different training strategies when training the network with or without ground-truth SMPL parameter annotations. Without bells and whistles, the network is quite easy to train and the results are highly convincing. In comparison to other methods, SimpleMeshNet's performance is measured using a video with five persons and an RTX3090 GPU. SimpleMeshNet alone can achieve 107 frames per second, whereas the whole system can get 45 frames per second while using YOLOv3-416 as a tracker. Compared with the leading algorithms, the performance of SimpleMeshNet can rival them, sometimes even better. What’s more, SimpleMeshNet can be used to process different in-the-wild images captured by a variety of devices: cell phones, monitors, cameras, and more.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

References

Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019). https://doi.org/10.1109/cvpr.2019.00463
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5349–5358 (2019). https://doi.org/10.1109/iccv.2019.00545
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/cvpr.2018.00744
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016). https://doi.org/10.1109/cvpr.2016.90
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019). https://doi.org/10.1109/iccv.2019.00234
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263. (2020). https://doi.org/10.1109/cvpr42600.2020.00530
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005). https://doi.org/10.1145/1186822.1073207
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015). https://doi.org/10.1145/2816795.2818013
Article Google Scholar
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019) https://doi.org/10.1109/cvpr.2019.01123
Tan, J.K.V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3d human body shape and pose prediction. In: British Machine Vision Conference (2017). https://doi.org/10.5244/c.31.15
Tung, H-Y.F., Tung, H-W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. arXiv Prepr. arXiv1712.01337 (2017)
Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D vision (3DV),pp. 484–494 (2018). https://doi.org/10.1109/3dv.2018.00062
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/cvpr.2018.00055
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European Conference on Computer Vision, pp. 561–578 (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016). https://doi.org/10.1109/cvpr.2016.533
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6050–6059 (2017). https://doi.org/10.1109/cvpr.2017.500
Guler, R.A., Kokkinos, I.: Holopose: holistic 3d human reconstruction in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10884–10894 (2019). https://doi.org/10.1109/cvpr.2019.01114
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. arXiv Prepr. arXiv2012.09760 (2020)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Jain, A.: Structure-aware and temporally coherent 3d human pose estimation. arXiv Prepr. arXiv1711.09250, 3(4):6. https://doi.org/10.1016/j.patrec.2019.05.020 (2017)
Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018). https://doi.org/10.51202/9783186869104-40
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Theobalt, C.: Xnect: Real-time multi-person 3d human pose estimation with a single rgb camera. arXiv Prepr. arXiv1907.00837 (2019) https://doi.org/10.1145/3386569.3392410
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 36(4), 1–14 (2017). https://doi.org/10.1145/3072959.3073596
Article Google Scholar
Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3395–3404 (2019). https://doi.org/10.1109/cvpr.2019.00351
Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P.V., Black, M.J.: Towards accurate marker-less human shape and pose estimation over time. In: 2017 International Conference on 3D vision (3DV), pp. 421–430 (2017). https://doi.org/10.1109/3dv.2017.00055
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019). https://doi.org/10.1109/cvpr.2019.00576
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Polosukhin, I.: Attention is all you need. arXiv Prepr. arXiv1706.03762 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645 (2016). https://doi.org/10.1109/cvpr.2018.00466
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3d human motion estimation via motion compression and refinement, In: Proceedings of the Asian Conference on Computer Vision (2020)
Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3d human pose and shape from a video, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1964–1973 (2021)
Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv Prepr. arXiv1907.02499 (2019)
Shanyan G., Jingwei X., Yunbo W., Bingbing N., Xiaokang Y.: Bilevel online adaptation for out-of-Domain human mesh reconstruction. arXiv Prepr. arXiv2013.16449. ECCV (2021)
Loper, M., Mahmood, N., Black, M.J.: MoSh: Motion and shape capture from sparse markers. ACM Trans. Graph. 33(6), 1–13 (2014). https://doi.org/10.1145/2661229/2661273
Article Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), 601–617. 4, https://doi.org/10.1007/978-3-030-01249-6_37 (2018)
Hanbyul, J., Natalia, N., Andrea V.: Exemplar Fine-Tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. arXiv preprint arXiv:2004.03686 (2020)
Gyeongsik M., Kyoung M.L.: I2l-MeshNet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: ECCV. https://doi.org/10.1007/978-3-030-58571-6_44 (2020)
Hongsuk, C., Gyeongsik M., Kyoung M.L.: Pose2Mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: ECCV, pp. 769–787. https://doi.org/10.1007/978-3-030-58571-6_45 (2020)

Download references

Funding

This is funded by National Natural Science Foundation of China (11772053, Qinwei Ma, 11727801, Shaopeng Ma).

Author information

Authors and Affiliations

School of Aerospace Engineering, Beijing Institute of Technology, Beijing, 100081, China
Wenzhang Sun, Xuanfang He & Qinwei Ma
Department of Engineering Mechanics, Shanghai Jiaotong University, Shanghai, 200240, China
Shaopeng Ma

Authors

Wenzhang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shaopeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xuanfang He
View author publications
You can also search for this author in PubMed Google Scholar
Qinwei Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinwei Ma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, W., Ma, S., He, X. et al. SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer. J Real-Time Image Proc 19, 703–713 (2022). https://doi.org/10.1007/s11554-022-01214-2

Download citation

Received: 09 August 2021
Accepted: 13 March 2022
Published: 28 April 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11554-022-01214-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation