VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

Su, Jiajun; Wang, Chunyu; Ma, Xiaoxuan; Zeng, Wenjun; Wang, Yizhou

doi:10.1007/978-3-031-20068-7_4

Jiajun Su^12,13,
Chunyu Wang¹⁶,
Xiaoxuan Ma^13,14,
Wenjun Zeng¹⁷ &
…
Yizhou Wang^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

European Conference on Computer Vision

1843 Accesses
3 Citations

Abstract

While monocular 3D pose estimation seems to have achieved very accurate results on the public datasets, their generalization ability is largely overlooked. In this work, we perform a systematic evaluation of the existing methods and find that they get notably larger errors when tested on different cameras, human poses and appearance. To address the problem, we introduce VirtualPose, a two-stage learning framework to exploit the hidden “free lunch” specific to this task, i.e.generating infinite number of poses and cameras for training models at no cost. To that end, the first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses. It outperforms the SOTA methods without using any paired images and 3D poses from the benchmarks, which paves the way for practical applications. Code is available at https://github.com/wkom/VirtualPose.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
Chapter Google Scholar
Chang, J.Y., Moon, G., Lee, K.M.: Absposelifter: absolute 3D human pose lifting network from a single noisy 2d human pose. CoRR (2019)
Google Scholar
Cheng, Y., Wang, B., Tan, R.: Dual networks based 3d multi-person pose estimation from monocular video. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Ci, H., Ma, X., Wang, C., Wang, Y.: Locally connected network for monocular 3D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1429–1442 (2020)
Article Google Scholar
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3D human pose estimation. In: ICCV, pp. 2262–2271 (2019)
Google Scholar
Dabral, R., Gundavarapu, N.B., Mitra, R., Sharma, A., Ramakrishnan, G., Jain, A.: Multi-person 3D human pose estimation from monocular images. In: 3dv, pp. 405–414. IEEE (2019)
Google Scholar
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: CVPR, pp. 7204–7213 (2020)
Google Scholar
Guo, Y., Ma, L., Li, Z., Wang, X., Wang, F.: Monocular 3d multi-person pose estimation via predicting factorised correction factors. In: Computer Vision and Image Understanding (CVIU), p. 103278 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI 36(7), 1325–1339 (2013)
Article Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: ICCV (2019)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV, pp. 3334–3342 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Li, S., Ke, L., Pratama, K., Tai, Y.W., Tang, C.K., Cheng, K.T.: Cascaded deep monocular 3d human pose estimation with evolutionary training data. In: CVPR, pp. 6173–6183 (2020)
Google Scholar
Lin, J., Lee, G.H.: HDNet: human depth estimation for multi-person camera-space localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 633–648. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_37
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3D human pose estimation: a unified perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6238–6247 (2021)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Chapter Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV, pp. 2640–2649 (2017)
Google Scholar
Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: 3DV, pp. 506–516. IEEE (2017)
Google Scholar
Mehta, D., et al.: Xnect: real-time multi-person 3D human pose estimation with a single RGB camera. TOG 39(4) (2020)
Google Scholar
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV, pp. 120–130. IEEE (2018)
Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV, pp. 10133–10142 (2019)
Google Scholar
Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: CVPR, pp. 2823–2832 (2017)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR, pp. 7025–7034 (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR, pp. 7753–7762 (2019)
Google Scholar
Popa, A.I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR, pp. 6289–6298 (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net: localization-classification-regression for human pose. In: CVPR, pp. 3433–3441 (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net++: multi-person 2D and 3D pose detection in natural images. PAMI 42(5), 1146–1161 (2019)
Google Scholar
Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: Metrabs: metric-scale truncation-robust heatmaps for absolute 3D human pose estimation. IEEE Trans. Biometr. Behav. Ident. Sci. 3(1), 16–30 (2020)
Article Google Scholar
Sigal, L., Balan, A.O., Black, M.J.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV 87(1–2), 4 (2010)
Article Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Chapter Google Scholar
Tu, H., Wang, C., Zeng, W.: VoxelPose: towards multi-camera 3d human pose estimation in wild environment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 197–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_12
Chapter Google Scholar
Véges, M., Lőrincz, A.: Absolute human pose estimation with depth prediction network. In: IJCNN, pp. 1–7. IEEE (2019)
Google Scholar
Véges, M., Lőrincz, A.: Multi-person absolute 3D human pose estimation with weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_21
Chapter Google Scholar
Wandt, B., Rosenhahn, B.: RepNet: weakly supervised training of an adversarial reprojection network for 3D human pose estimation. In: CVPR, pp. 7782–7791 (2019)
Google Scholar
Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person ordinal relations for monocular multi-person 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_15
Chapter Google Scholar
Wang, C., Wang, Y., Lin, Z., Yuille, A.L.: Robust 3d human pose estimation from single images or video sequences. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1227–1241 (2018)
Article Google Scholar
Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2361–2368 (2014)
Google Scholar
Wu, J., et al.: 3D interpreter networks for viewer-centered wireframe modeling. Int. J. Comput. Vision 126(9), 1009–1026 (2018)
Article Google Scholar
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: CVPR, pp. 5255–5264 (2018)
Google Scholar
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR, pp. 2148–2157 (2018)
Google Scholar
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3d sensing of multiple people in natural images. NIPS 31, 8410–8419 (2018)
Google Scholar
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. T-PAMI (2022)
Google Scholar
Zhang, Z., Wang, C., Qin, W., Zeng, W.: Fusing wearable Imus with multi-view images for human pose estimation: a geometric approach. In: CVPR, pp. 2200–2209 (2020)
Google Scholar
Zhang, Z., Wang, C., Qiu, W., Qin, W., Zeng, W.: Adafuse: adaptive multiview fusion for accurate human pose estimation in the wild. IJCV 129(3), 703–718 (2021)
Article Google Scholar
Zhen, J., et al.: SMAP: single-shot multi-person absolute 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 550–566. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_33
Chapter Google Scholar
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV, pp. 398–407 (2017)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 177–194. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_11
Chapter Google Scholar

Download references

Acknowledgement

This work was supported in part by MOST-2018AAA0102004 and NSFC-62061136001.

Author information

Authors and Affiliations

Center for Data Science, Peking University, Beijing, China
Jiajun Su
Center on Frontiers of Computing Studies, Peking University, Beijing, China
Jiajun Su, Xiaoxuan Ma & Yizhou Wang
Department of Computer Science, Peking University, Beijing, China
Xiaoxuan Ma & Yizhou Wang
Institute for Artificial Intelligence, Peking University, Beijing, China
Yizhou Wang
Microsoft Research Asia, Beijing, China
Chunyu Wang
EIT Institute for Advanced Study, Ningbo, China
Wenjun Zeng

Authors

Jiajun Su
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Wenjun Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunyu Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, J., Wang, C., Ma, X., Zeng, W., Wang, Y. (2022). VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-20068-7_4
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data