Abstract
While monocular 3D pose estimation seems to have achieved very accurate results on the public datasets, their generalization ability is largely overlooked. In this work, we perform a systematic evaluation of the existing methods and find that they get notably larger errors when tested on different cameras, human poses and appearance. To address the problem, we introduce VirtualPose, a two-stage learning framework to exploit the hidden “free lunch” specific to this task, i.e.generating infinite number of poses and cameras for training models at no cost. To that end, the first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses. It outperforms the SOTA methods without using any paired images and 3D poses from the benchmarks, which paves the way for practical applications. Code is available at https://github.com/wkom/VirtualPose.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23
Chang, J.Y., Moon, G., Lee, K.M.: Absposelifter: absolute 3D human pose lifting network from a single noisy 2d human pose. CoRR (2019)
Cheng, Y., Wang, B., Tan, R.: Dual networks based 3d multi-person pose estimation from monocular video. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Ci, H., Ma, X., Wang, C., Wang, Y.: Locally connected network for monocular 3D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1429–1442 (2020)
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3D human pose estimation. In: ICCV, pp. 2262–2271 (2019)
Dabral, R., Gundavarapu, N.B., Mitra, R., Sharma, A., Ramakrishnan, G., Jain, A.: Multi-person 3D human pose estimation from monocular images. In: 3dv, pp. 405–414. IEEE (2019)
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: CVPR, pp. 7204–7213 (2020)
Guo, Y., Ma, L., Li, Z., Wang, X., Wang, F.: Monocular 3d multi-person pose estimation via predicting factorised correction factors. In: Computer Vision and Image Understanding (CVIU), p. 103278 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI 36(7), 1325–1339 (2013)
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: ICCV (2019)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV, pp. 3334–3342 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Li, S., Ke, L., Pratama, K., Tai, Y.W., Tang, C.K., Cheng, K.T.: Cascaded deep monocular 3d human pose estimation with evolutionary training data. In: CVPR, pp. 6173–6183 (2020)
Lin, J., Lee, G.H.: HDNet: human depth estimation for multi-person camera-space localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 633–648. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_37
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3D human pose estimation: a unified perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6238–6247 (2021)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV, pp. 2640–2649 (2017)
Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved CNN supervision. In: 3DV, pp. 506–516. IEEE (2017)
Mehta, D., et al.: Xnect: real-time multi-person 3D human pose estimation with a single RGB camera. TOG 39(4) (2020)
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 3DV, pp. 120–130. IEEE (2018)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV, pp. 10133–10142 (2019)
Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: CVPR, pp. 2823–2832 (2017)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR, pp. 7025–7034 (2017)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR, pp. 7753–7762 (2019)
Popa, A.I., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR, pp. 6289–6298 (2017)
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net: localization-classification-regression for human pose. In: CVPR, pp. 3433–3441 (2017)
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net++: multi-person 2D and 3D pose detection in natural images. PAMI 42(5), 1146–1161 (2019)
Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: Metrabs: metric-scale truncation-robust heatmaps for absolute 3D human pose estimation. IEEE Trans. Biometr. Behav. Ident. Sci. 3(1), 16–30 (2020)
Sigal, L., Balan, A.O., Black, M.J.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV 87(1–2), 4 (2010)
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_33
Tu, H., Wang, C., Zeng, W.: VoxelPose: towards multi-camera 3d human pose estimation in wild environment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 197–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_12
Véges, M., Lőrincz, A.: Absolute human pose estimation with depth prediction network. In: IJCNN, pp. 1–7. IEEE (2019)
Véges, M., Lőrincz, A.: Multi-person absolute 3D human pose estimation with weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_21
Wandt, B., Rosenhahn, B.: RepNet: weakly supervised training of an adversarial reprojection network for 3D human pose estimation. In: CVPR, pp. 7782–7791 (2019)
Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person ordinal relations for monocular multi-person 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_15
Wang, C., Wang, Y., Lin, Z., Yuille, A.L.: Robust 3d human pose estimation from single images or video sequences. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1227–1241 (2018)
Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2361–2368 (2014)
Wu, J., et al.: 3D interpreter networks for viewer-centered wireframe modeling. Int. J. Comput. Vision 126(9), 1009–1026 (2018)
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: CVPR, pp. 5255–5264 (2018)
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3D pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In: CVPR, pp. 2148–2157 (2018)
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network for the integrated 3d sensing of multiple people in natural images. NIPS 31, 8410–8419 (2018)
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. T-PAMI (2022)
Zhang, Z., Wang, C., Qin, W., Zeng, W.: Fusing wearable Imus with multi-view images for human pose estimation: a geometric approach. In: CVPR, pp. 2200–2209 (2020)
Zhang, Z., Wang, C., Qiu, W., Qin, W., Zeng, W.: Adafuse: adaptive multiview fusion for accurate human pose estimation in the wild. IJCV 129(3), 703–718 (2021)
Zhen, J., et al.: SMAP: single-shot multi-person absolute 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 550–566. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_33
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV, pp. 398–407 (2017)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 177–194. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_11
Acknowledgement
This work was supported in part by MOST-2018AAA0102004 and NSFC-62061136001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Su, J., Wang, C., Ma, X., Zeng, W., Wang, Y. (2022). VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-20068-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)