Abstract
Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior, and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence remains challenging; modeling quality is strongly influenced by temporal consistency of the captured body motion. Our work presents an elegant solution to integrating temporal constraints during fitting. This increases both temporal consistency and robustness during optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person. We optimize these parameters over the complete image sequence, fitting a single consistent body shape while imposing temporal consistency on the body motion, assuming body joint trajectories to be linear over short time. Our approach enables the derivation of realistic 3D body models from image sequences, including jaw pose, facial expression, and articulated hands. Our experiments show that our approach accurately estimates body shape and motion, even for challenging movements and poses. Further, we apply it to the particular application of sign language analysis, where accurate and temporally consistent motion modelling is essential, and show that the approach is well-suited to this kind of application.
![](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs41095-022-0272-x/MediaObjects/41095_2022_272_Fig1_HTML.jpg)
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A. A.; Tzionas, D.; Black, M. J. Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10967–10977, 2019.
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 1, 172–186, 2021.
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M. J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 561–578, 2016.
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M. J. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics Vol. 34, No. 6, Article No. 248, 2015.
Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; Black, M. J. Monocular expressive body regression through body-driven attention. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12355. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 20–40, 2020.
Kanazawa, A.; Black, M. J.; Jacobs, D. W.; Malik, J. End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7122–7131, 2018.
Zhou, Y. X.; Habermann, M.; Habibie, I.; Tewari, A.; Theobalt, C.; Xu, F. Monocular real-time full body capture with inter-part correlations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4809–4820, 2021.
Romero, J.; Tzionas, D.; Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics Vol. 36, No. 6, Article No. 245, 2017.
Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 11446–11456, 2021.
Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M. J.; Gehler, P. V. Unite the people: Closing the loop between 3D and 2D human representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4704–4713, 2017.
Zhang, J. Y.; Pepose, S.; Joo, H.; Ramanan, D.; Malik, J.; Kanazawa, A. Perceiving 3D human-object spatial arrangements from a single image in the wild. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 34–51, 2020.
Xu, X. Y.; Chen, H.; Moreno-Noguer, F.; Jeni, L. A.; de la Torre, F. 3D human pose, shape and texture from low-resolution images and videos. IEEE Transactions on Pattern Analysis and Machine Intelligence doi: https://doi.org/10.1109/TPAMI.2021.3070002, 2021.
Zou, S. H.; Zuo, X. X.; Qian, Y. M.; Wang, S.; Xu, C.; Gong, M. L.; Cheng, L. 3D human shape reconstruction from a polarization image. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 351–368, 2020.
Zuo, X. X.; Wang, S.; Zheng, J. B.; Yu, W. W.; Gong, M. L.; Yang, R. G.; Cheng, L. SparseFusion: Dynamic human avatar modeling from sparse RGBD images. IEEE Transactions on Multimedia Vol. 23, 1617–1629, 2021.
Xu, L.; Xu, W. P.; Golyanik, V.; Habermann, M.; Fang, L.; Theobalt, C. EventCap: Monocular 3D capture of high-speed human motions using an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4967–4977, 2020.
Kocabas, M.; Athanasiou, N.; Black, M. J. VIBE: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5252–5262, 2020.
Mahmood, N.; Ghorbani, N.; Troje, N. F.; Pons-Moll, G.; Black, M. AMASS: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5441–5450, 2019.
Xiang, D. L.; Joo, H.; Sheikh, Y. Monocular total capture: Posing face, body, and hands in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10957–10966, 2019.
Joo, H.; Simon, T.; Sheikh, Y. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8320–8329, 2018.
Hodgins, J. K. CMU Graphics Lab Motion Capture Database. Available at http://mocap.cs.cmu.edu/resources.php.
Rong, Y.; Shiratori, T.; Joo, H. FrankMocap: Fast monocular 3D hand and body motion capture by regression and integration. arXiv preprint arXiv: 2008.08324, 2020.
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2252–2261, 2019.
Caliskan, A.; Mustafa, A.; Hilton, A. Temporal consistency loss for high resolution textured and clothed 3D human reconstruction from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1780–1790, 2021.
He, Y. N.; Pang, A. Q.; Chen, X.; Liang, H.; Wu, M. Y.; Ma, Y. X.; Xu, L. ChallenCap: Monocular 3D capture of challenging human performances using multi-modal references. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11395–11406, 2021.
Zheng, C.; Wu, W. H.; Chen, C.; Yang, T.; Zhu, S. J.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. arXiv preprint arXiv:2012.13392, 2020.
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7745–7754, 2019.
Wen, Y.-H.; Gao, L.; Fu, H.; Zhang, F.-L.; Xia, S. Graph CNNs with motif and variable temporal block for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 01, 8989–8996, 2019.
Teschner, M.; Kimmerle, S.; Heidelberger, B.; Zachmann, G.; Raghupathi, L.; Fuhrmann, A.; Cani, M.; Faure, F.; Magnenat-Thalmann, N.; Straßer, W.; et al. Collision detection for deformable objects. In: Proceedings of the 25th Annual Conference of the European Association for Computer Graphics, Eurographics 2004 — State of the Art Reports, 2004.
Nocedal, J.; Wright, S. J. Nonlinear equations. In: Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer New York, 276–312, 2006.
Ghorbani, S.; Mahdaviani, K.; Thaler, A.; Kording, K.; Cook, D. J.; Blohm, G.; Troje, N. F. MoVi: A large multi-purpose human motion and video dataset. PLoS ONE Vol. 16, No. 6, e0253157, 2021.
Fournier, M.; Dischler, J. M.; Bechmann, D. 3D distance transform adaptive filtering for smoothing and denoising triangle meshes. In: Proceedings of the 4th International Conference on Computer Graphics and Interactive Techniques in Australasia and Southeast Asia, 407–416, 2006.
Acknowledgements
This work was partly funded by the European Union’s Horizon 2020 Research and Innovation Programme under Agreement No. 952147 (Invictus) as well as the German Federal Ministry of Education and Research (BMBF) through the Research Program MoDL under Contract No. 01 IS 20044.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Alexandra Zimmer is a student assistant in the Computer Vision & Graphics Group at Fraunhofer HHI. She is a computer science master student at Technische Universitat Berlin and received her bachelor degree in computer science from Humboldt Universität zu Berlin (HU Berlin) in 2020. Her research interests are body motion capture, 3D image and video inference, deep learning, and computer vision in robotics.
Anna Hilsmann heads the Vision & Imaging Technologies Department as well as the Computer Vision & Graphics Group at Fraunhofer HHI. She received her Dipl.-Ing. degree in electrical engineering and information technology from RWTH Aachen in 2006 and her Dr.-Ing. degree in computer science from HU Berlin, in 2014. Her main research interests cover 3D image and video analysis, model-based deformable tracking and 3D reconstruction, as well as image- and video-based rendering, animation, and editing.
Wieland Morgenstern joined Fraunhofer HHI as a research associate in 2018. He is performing research in mesh sequence registration, human body model registration, and animation of volumetric video. Before coming to Fraunhofer HHI, Wieland worked as a computer vision engineer at VideoStitch/Orah from 2014 to 2018. He received his B.Sc. degree in 2012 and his M.Sc. degree in 2015 from Ilmenau Technical University.
Peter Eisert is a professor for visual computing at HU Berlin and heads the Vision & Imaging Technologies Department of the Fraunhofer HHI, Berlin. He received his Dr.-Ing. degree in 2000 from the University of Erlangen and worked as postdoctoral fellow at Stanford University. He has published more than 250 conference and journal papers and is an associate editor of the International Journal of Image and Video Processing as well as on the editorial board of the Journal of Visual Communication and Image Representation. His research interests include 3D image analysis and synthesis, face processing, image-based rendering, deep learning, computer vision, and computer graphics.
Electronic supplementary material
Supplementary material, approximately 38.6 MB.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Zimmer, A., Hilsmann, A., Morgenstern, W. et al. Imposing temporal consistency on deep monocular body shape and pose estimation. Comp. Visual Media 9, 123–139 (2023). https://doi.org/10.1007/s41095-022-0272-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-022-0272-x