Imposing temporal consistency on deep monocular body shape and pose estimation

Zimmer, Alexandra; Hilsmann, Anna; Morgenstern, Wieland; Eisert, Peter

doi:10.1007/s41095-022-0272-x

Imposing temporal consistency on deep monocular body shape and pose estimation

Research Article
Open access
Published: 18 October 2022

Volume 9, pages 123–139, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

Imposing temporal consistency on deep monocular body shape and pose estimation

Download PDF

Alexandra Zimmer^1,2,
Anna Hilsmann¹,
Wieland Morgenstern¹ &
…
Peter Eisert^1,3

941 Accesses
1 Citation
Explore all metrics

Abstract

Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior, and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence remains challenging; modeling quality is strongly influenced by temporal consistency of the captured body motion. Our work presents an elegant solution to integrating temporal constraints during fitting. This increases both temporal consistency and robustness during optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person. We optimize these parameters over the complete image sequence, fitting a single consistent body shape while imposing temporal consistency on the body motion, assuming body joint trajectories to be linear over short time. Our approach enables the derivation of realistic 3D body models from image sequences, including jaw pose, facial expression, and articulated hands. Our experiments show that our approach accurately estimates body shape and motion, even for challenging movements and poses. Further, we apply it to the particular application of sign language analysis, where accurate and temporally consistent motion modelling is essential, and show that the approach is well-suited to this kind of application.

Article PDF

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Rotation update on manifold in probabilistic NRSFM for robust 3D face modeling

Article Open access 22 December 2015

Human Body Model Fitting by Learned Gradient Descent

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A. A.; Tzionas, D.; Black, M. J. Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10967–10977, 2019.
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 1, 172–186, 2021.
Article Google Scholar
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M. J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 561–578, 2016.
Chapter Google Scholar
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M. J. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics Vol. 34, No. 6, Article No. 248, 2015.
Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; Black, M. J. Monocular expressive body regression through body-driven attention. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12355. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 20–40, 2020.
Chapter Google Scholar
Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.
Zhou, Y. X.; Habermann, M.; Habibie, I.; Tewari, A.; Theobalt, C.; Xu, F. Monocular real-time full body capture with inter-part correlations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4809–4820, 2021.
Romero, J.; Tzionas, D.; Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics Vol. 36, No. 6, Article No. 245, 2017.
Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11446–11456, 2021.
Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M. J.; Gehler, P. V. Unite the people: Closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4704–4713, 2017.
Zhang, J. Y.; Pepose, S.; Joo, H.; Ramanan, D.; Malik, J.; Kanazawa, A. Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 34–51, 2020.
Chapter Google Scholar
Xu, X. Y.; Chen, H.; Moreno-Noguer, F.; Jeni, L. A.; de la Torre, F. 3D human pose, shape and texture from low-resolution images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence doi: https://doi.org/10.1109/TPAMI.2021.3070002, 2021.
Zou, S. H.; Zuo, X. X.; Qian, Y. M.; Wang, S.; Xu, C.; Gong, M. L.; Cheng, L. 3D human shape reconstruction from a polarization image. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 351–368, 2020.
Chapter Google Scholar
Zuo, X. X.; Wang, S.; Zheng, J. B.; Yu, W. W.; Gong, M. L.; Yang, R. G.; Cheng, L. SparseFusion: Dynamic human avatar modeling from sparse RGBD images. IEEE Transactions on Multimedia Vol. 23, 1617–1629, 2021.
Article Google Scholar
Xu, L.; Xu, W. P.; Golyanik, V.; Habermann, M.; Fang, L.; Theobalt, C. EventCap: Monocular 3D capture of high-speed human motions using an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4967–4977, 2020.
Kocabas, M.; Athanasiou, N.; Black, M. J. VIBE: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5252–5262, 2020.
Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; Black, M. AMASS: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5441–5450, 2019.
Xiang, D. L.; Joo, H.; Sheikh, Y. Monocular total capture: Posing face, body, and hands in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10957–10966, 2019.
Joo, H.; Simon, T.; Sheikh, Y. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8320–8329, 2018.
Hodgins, J. K. CMU Graphics Lab Motion Capture Database. Available at http://mocap.cs.cmu.edu/resources.php.
Rong, Y.; Shiratori, T.; Joo, H. FrankMocap: Fast monocular 3D hand and body motion capture by regression and integration. arXiv preprint arXiv: 2008.08324, 2020.
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2252–2261, 2019.
Caliskan, A.; Mustafa, A.; Hilton, A. Temporal consistency loss for high resolution textured and clothed 3D human reconstruction from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1780–1790, 2021.
He, Y. N.; Pang, A. Q.; Chen, X.; Liang, H.; Wu, M. Y.; Ma, Y. X.; Xu, L. ChallenCap: Monocular 3D capture of challenging human performances using multi-modal references. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11395–11406, 2021.
Zheng, C.; Wu, W. H.; Chen, C.; Yang, T.; Zhu, S. J.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. arXiv preprint arXiv:2012.13392, 2020.
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7745–7754, 2019.
Wen, Y.-H.; Gao, L.; Fu, H.; Zhang, F.-L.; Xia, S. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 01, 8989–8996, 2019.
Article Google Scholar
Teschner, M.; Kimmerle, S.; Heidelberger, B.; Zachmann, G.; Raghupathi, L.; Fuhrmann, A.; Cani, M.; Faure, F.; Magnenat-Thalmann, N.; Straßer, W.; et al. Collision detection for deformable objects. In: Proceedings of the 25th Annual Conference of the European Association for Computer Graphics, Eurographics 2004 — State of the Art Reports, 2004.
Nocedal, J.; Wright, S. J. Nonlinear equations. In: Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer New York, 276–312, 2006.
MATH Google Scholar
Ghorbani, S.; Mahdaviani, K.; Thaler, A.; Kording, K.; Cook, D. J.; Blohm, G.; Troje, N. F. MoVi: A large multi-purpose human motion and video dataset. PLoS ONE Vol. 16, No. 6, e0253157, 2021.
Article Google Scholar
Fournier, M.; Dischler, J. M.; Bechmann, D. 3D distance transform adaptive filtering for smoothing and denoising triangle meshes. In: Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, 407–416, 2006.

Download references

Acknowledgements

This work was partly funded by the European Union’s Horizon 2020 Research and Innovation Programme under Agreement No. 952147 (Invictus) as well as the German Federal Ministry of Education and Research (BMBF) through the Research Program MoDL under Contract No. 01 IS 20044.

Author information

Authors and Affiliations

Fraunhofer Heinrich-Hertz-Institut, 10587, Berlin, Germany
Alexandra Zimmer, Anna Hilsmann, Wieland Morgenstern & Peter Eisert
Technische Universität Berlin, 10623, Berlin, Germany
Alexandra Zimmer
Humboldt Universität zu Berlin, 10117, Berlin, Germany
Peter Eisert

Authors

Alexandra Zimmer
View author publications
You can also search for this author in PubMed Google Scholar
Anna Hilsmann
View author publications
You can also search for this author in PubMed Google Scholar
Wieland Morgenstern
View author publications
You can also search for this author in PubMed Google Scholar
Peter Eisert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Zimmer.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Alexandra Zimmer is a student assistant in the Computer Vision & Graphics Group at Fraunhofer HHI. She is a computer science master student at Technische Universitat Berlin and received her bachelor degree in computer science from Humboldt Universität zu Berlin (HU Berlin) in 2020. Her research interests are body motion capture, 3D image and video inference, deep learning, and computer vision in robotics.

Anna Hilsmann heads the Vision & Imaging Technologies Department as well as the Computer Vision & Graphics Group at Fraunhofer HHI. She received her Dipl.-Ing. degree in electrical engineering and information technology from RWTH Aachen in 2006 and her Dr.-Ing. degree in computer science from HU Berlin, in 2014. Her main research interests cover 3D image and video analysis, model-based deformable tracking and 3D reconstruction, as well as image- and video-based rendering, animation, and editing.

Wieland Morgenstern joined Fraunhofer HHI as a research associate in 2018. He is performing research in mesh sequence registration, human body model registration, and animation of volumetric video. Before coming to Fraunhofer HHI, Wieland worked as a computer vision engineer at VideoStitch/Orah from 2014 to 2018. He received his B.Sc. degree in 2012 and his M.Sc. degree in 2015 from Ilmenau Technical University.

Peter Eisert is a professor for visual computing at HU Berlin and heads the Vision & Imaging Technologies Department of the Fraunhofer HHI, Berlin. He received his Dr.-Ing. degree in 2000 from the University of Erlangen and worked as postdoctoral fellow at Stanford University. He has published more than 250 conference and journal papers and is an associate editor of the International Journal of Image and Video Processing as well as on the editorial board of the Journal of Visual Communication and Image Representation. His research interests include 3D image analysis and synthesis, face processing, image-based rendering, deep learning, computer vision, and computer graphics.

Electronic supplementary material

Supplementary material, approximately 38.6 MB.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Zimmer, A., Hilsmann, A., Morgenstern, W. et al. Imposing temporal consistency on deep monocular body shape and pose estimation. Comp. Visual Media 9, 123–139 (2023). https://doi.org/10.1007/s41095-022-0272-x

Download citation

Received: 06 October 2021
Accepted: 20 January 2022
Published: 18 October 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s41095-022-0272-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Imposing temporal consistency on deep monocular body shape and pose estimation

Abstract

Article PDF

Similar content being viewed by others

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Rotation update on manifold in probabilistic NRSFM for robust 3D face modeling

Human Body Model Fitting by Learned Gradient Descent

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Imposing temporal consistency on deep monocular body shape and pose estimation

Abstract

Article PDF

Similar content being viewed by others

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Rotation update on manifold in probabilistic NRSFM for robust 3D face modeling

Human Body Model Fitting by Learned Gradient Descent

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation