Skip to main content

Advertisement

Log in

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

According to recent research, reconstructing high-precision 3D human body shape and pose using neural networks necessitates not just large datasets with ground-truth 3D annotations, but also depends significantly on sophisticated network structures to utilize spatial and temporal information. Employing these strategies will also make training more difficult and time-consuming. We proposed SimpleMeshNet, the simplest frame-based model to present, to estimate 3D human body mesh for in-the-wild images. On the one hand, the SimpleMeshNet contains just one fully connected layer after extracting the features and utilizing a pre-trained ResNet as a regressor to output the SMPL model parameters; on the other hand, it performed well and runs fairly fast. To minimize overfitting concerns when the ground-truth SMPL annotations are missing, SimpleMeshNet employs two different training strategies when training the network with or without ground-truth SMPL parameter annotations. Without bells and whistles, the network is quite easy to train and the results are highly convincing. In comparison to other methods, SimpleMeshNet's performance is measured using a video with five persons and an RTX3090 GPU. SimpleMeshNet alone can achieve 107 frames per second, whereas the whole system can get 45 frames per second while using YOLOv3-416 as a tracker. Compared with the leading algorithms, the performance of SimpleMeshNet can rival them, sometimes even better. What’s more, SimpleMeshNet can be used to process different in-the-wild images captured by a variety of devices: cell phones, monitors, cameras, and more.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019). https://doi.org/10.1109/cvpr.2019.00463

  2. Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5349–5358 (2019). https://doi.org/10.1109/iccv.2019.00545

  3. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018). https://doi.org/10.1109/cvpr.2018.00744

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016). https://doi.org/10.1109/cvpr.2016.90

  5. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019). https://doi.org/10.1109/iccv.2019.00234

  6. Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263. (2020). https://doi.org/10.1109/cvpr42600.2020.00530

  7. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH 2005 Papers, pp. 408–416 (2005). https://doi.org/10.1145/1186822.1073207

  8. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015). https://doi.org/10.1145/2816795.2818013

    Article  Google Scholar 

  9. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019) https://doi.org/10.1109/cvpr.2019.01123

  10. Tan, J.K.V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3d human body shape and pose prediction. In: British Machine Vision Conference (2017). https://doi.org/10.5244/c.31.15

  11. Tung, H-Y.F., Tung, H-W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. arXiv Prepr. arXiv1712.01337 (2017)

  12. Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model based human pose and shape estimation. In: 2018 International Conference on 3D vision (3DV),pp. 484–494 (2018). https://doi.org/10.1109/3dv.2018.00062

  13. Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018). https://doi.org/10.1109/cvpr.2018.00055

  14. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European Conference on Computer Vision, pp. 561–578 (2016). https://doi.org/10.1007/978-3-319-46454-1_34

  15. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016). https://doi.org/10.1109/cvpr.2016.533

  16. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6050–6059 (2017). https://doi.org/10.1109/cvpr.2017.500

  17. Guler, R.A., Kokkinos, I.: Holopose: holistic 3d human reconstruction in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10884–10894 (2019). https://doi.org/10.1109/cvpr.2019.01114

  18. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. arXiv Prepr. arXiv2012.09760 (2020)

  19. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)

  20. Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Jain, A.: Structure-aware and temporally coherent 3d human pose estimation. arXiv Prepr. arXiv1711.09250, 3(4):6. https://doi.org/10.1016/j.patrec.2019.05.020 (2017)

  21. Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018). https://doi.org/10.51202/9783186869104-40

  22. Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Theobalt, C.: Xnect: Real-time multi-person 3d human pose estimation with a single rgb camera. arXiv Prepr. arXiv1907.00837 (2019) https://doi.org/10.1145/3386569.3392410

  23. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 36(4), 1–14 (2017). https://doi.org/10.1145/3072959.3073596

    Article  Google Scholar 

  24. Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3395–3404 (2019). https://doi.org/10.1109/cvpr.2019.00351

  25. Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P.V., Black, M.J.: Towards accurate marker-less human shape and pose estimation over time. In: 2017 International Conference on 3D vision (3DV), pp. 421–430 (2017). https://doi.org/10.1109/3dv.2017.00055

  26. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019). https://doi.org/10.1109/cvpr.2019.00576

  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Polosukhin, I.: Attention is all you need. arXiv Prepr. arXiv1706.03762 (2017)

  28. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645 (2016). https://doi.org/10.1109/cvpr.2018.00466

  29. Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3d human motion estimation via motion compression and refinement, In: Proceedings of the Asian Conference on Computer Vision (2020)

  30. Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3d human pose and shape from a video, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1964–1973 (2021)

  31. Doersch, C., Zisserman, A.: Sim2real transfer learning for 3D human pose estimation: motion to the rescue. arXiv Prepr. arXiv1907.02499 (2019)

  32. Shanyan G., Jingwei X., Yunbo W., Bingbing N., Xiaokang Y.: Bilevel online adaptation for out-of-Domain human mesh reconstruction. arXiv Prepr. arXiv2013.16449. ECCV (2021)

  33. Loper, M., Mahmood, N., Black, M.J.: MoSh: Motion and shape capture from sparse markers. ACM Trans. Graph. 33(6), 1–13 (2014). https://doi.org/10.1145/2661229/2661273

    Article  Google Scholar 

  34. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), 601–617. 4, https://doi.org/10.1007/978-3-030-01249-6_37 (2018)

  35. Hanbyul, J., Natalia, N., Andrea V.: Exemplar Fine-Tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. arXiv preprint arXiv:2004.03686 (2020)

  36. Gyeongsik M., Kyoung M.L.: I2l-MeshNet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: ECCV. https://doi.org/10.1007/978-3-030-58571-6_44 (2020)

  37. Hongsuk, C., Gyeongsik M., Kyoung M.L.: Pose2Mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: ECCV, pp. 769–787. https://doi.org/10.1007/978-3-030-58571-6_45 (2020)

Download references

Funding

This is funded by National Natural Science Foundation of China (11772053, Qinwei Ma, 11727801, Shaopeng Ma).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinwei Ma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, W., Ma, S., He, X. et al. SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer. J Real-Time Image Proc 19, 703–713 (2022). https://doi.org/10.1007/s11554-022-01214-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-022-01214-2

Keywords

Navigation