Abstract
While the voxel-based methods have achieved promising results for multi-person 3D pose estimation from multi-cameras, they suffer from heavy computation burdens, especially for large scenes. We present Faster VoxelPose to address the challenge by re-projecting the feature volume to the three two-dimensional coordinate planes and estimating X, Y, Z coordinates from them separately. To that end, we first localize each person by a 3D bounding box by estimating a 2D box and its height based on the volume features projected to the xy-plane and z-axis, respectively. Then for each person, we estimate partial joint coordinates from the three coordinate planes separately which are then fused to obtain the final 3D pose. The method is free from costly 3D-CNNs and improves the speed of VoxelPose by ten times and meanwhile achieves competitive accuracy as the state-of-the-art methods, proving its potential in real-time applications.
H. Ye and W. Zhu—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahuja, K., Ofek, E., Gonzalez-Franco, M., Holz, C., Wilson, A.D.: Coolmoves: user motion accentuation in virtual reality. In: IMWUT (2021)
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: CVPR (2014)
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2015)
Belagiannis, V., Wang, X., Schiele, B., Fua, P., Ilic, S., Navab, N.: Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 742–754. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16178-5_52
Bridgeman, L., Volino, M., Guillemaut, J.Y., Hilton, A.: Multi-person 3D pose estimation and tracking in sports. In: CVPR Workshops (2019)
Bultmann, S., Behnke, S.: Real-time multi-view 3D human pose estimation using semantic feedback to smart edge sensors. In: RSS (2021)
Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-Aware networks for 3D human pose estimation in video. In: ICCV (2019)
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3D human pose estimation. In: ICCV (2019)
Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Fast and robust multi-person 3D pose estimation from multiple views (2019)
Dong, J., Shuai, Q., Zhang, Y., Liu, X., Zhou, X., Bao, H.: Motion capture from internet videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 210–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_13
Ershadi-Nasab, S., Noury, E., Kasaei, S., Sanaei, E.: Multiple human 3D pose estimation from multiview images. Multimed. Tools Appl. 77(12), 15573–15601 (2017). https://doi.org/10.1007/s11042-017-5133-8
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: CVPR (2020)
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, USA (2003)
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: CVPR, pp. 7779–7788 (2020)
Huang, C., et al.: End-to-end dynamic matching network for multi-view multi-person 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 477–493. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_29
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose (2019)
Jansen, Y., Hornbæk, K.: How relevant are incidental power poses for HCI? In: CHI (2018)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: CVPR (2019)
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: ICCV (2021)
Lin, J., Lee, G.H.: Multi-view multi-person 3D pose estimation with plane sweep stereo. In: CVPR (2021)
Liu, F., Liu, X.: Voxel-based 3D detection and reconstruction of multiple objects from a single image. In: NeurIPS (2021)
Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3D human pose estimation: a unified perspective. In: CVPR (2021)
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera (2017)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3D human pose annotations. In: CVPR (2017)
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3D human pose estimation. In: ICCV (2019)
Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: CVPR (2016)
Reddy, N.D., Guigues, L., Pischulini, L., Eledath, J., Narasimhan, S.: Tessetrack: end-to-end learnable multi-person articulated 3D pose tracking. In: CVPR (2021)
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G.: Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15190–15200 (2021)
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: CVPR (2020)
Rukhovich, D., Vorontsova, A., Konushin, A.: ImvoxelNet: image to voxels projection for monocular and multi-view general-purpose 3d object detection. In: WACV (2022)
Sharma, S., Varigonda, P.T., Bindal, P., Sharma, A., Jain, A.: Monocular 3D human pose estimation by generation and ordinal ranking. In: ICCV (2019)
Shen, X., et al.: Towards fast and accurate multi-person pose estimation on mobile devices. In: IJCAI (2021)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: CVPR (2019)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Tu, H., Wang, C., Zeng, W.: VoxelPose: towards multi-camera 3D human pose estimation in wild environment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 197–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_12
Wang, C., Flynn, J., Wang, Y., Yuille, A.: Recognizing actions in 3D using action-snippets and activated simplices. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Wang, C., Wang, Y., Lin, Z., Yuille, A.L.: Robust 3D human pose estimation from single images or video sequences. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1227–1241 (2018)
Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2361–2368 (2014)
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)
Wang, C., Wang, Y., Yuille, A.L.: Mining 3D key-pose-motifs for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2639–2647 (2016)
Wang, T., Zhang, J., Cai, Y., Yan, S., Feng, J.: Direct multi-view multi-person 3D human pose estimation. In: Advances in Neural Information Processing Systems (2021)
Wang, X., Ang, M.H., Lee, G.H.: Voxel-based network for shape completion by leveraging edge generation. In: ICCV (2021)
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Wu, S., et al.: Graph-Based 3D Multi-Person pose estimation using Multi-View images. In: ICCV (2021)
Xu, J., Zhong, F., Wang, Y.: Learning multi-agent coordination for enhancing target coverage in directional sensor networks. In: Advances in Neural Information Processing Systems (2020)
Xu, L., et al.: ViPNAS: efficient video pose estimation via neural architecture search. In: CVPR (2021)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Yu, Z., et al.: HUMBI: a large multiview dataset of human body expressions. In: CVPR (2020)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR (2019)
Zhang, S., Staudt, E., Faltemier, T., Roy-chowdhury, A.K.: A camera network tracking (camnet) dataset and performance baseline. In: WACV (2015)
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. arXiv preprint arXiv:2108.02452
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
Zhang, Y., An, L., Yu, T., Li, x., Li, K., Liu, Y.: 4D association graph for realtime multi-person motion capture using multiple video cameras. In: CVPR (2020)
Zhang, Z., Wang, C., Qin, W., Zeng, W.: Fusing wearable IMUs with multi-view images for human pose estimation: a geometric approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2200–2209 (2020)
Zhang, Z., Wang, C., Qiu, W., Qin, W., Zeng, W.: AdaFuse: adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vision 129(3), 703–718 (2021)
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: ICCV (2021)
Zhong, Y., Zhu, M., Peng, H.: VIN: voxel-based implicit network for joint 3D object detection and segmentation for lidars (2021)
Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hemlets pose: learning part-centric heatmap triplets for accurate 3D human pose estimation. In: ICCV (2019)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: CVPR (2019)
Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 177–194. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_11
Acknowledgement
This work was supported in part by MOST-2018AAA0102004 and NSFC-62061136001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 14233 KB)
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ye, H., Zhu, W., Wang, C., Wu, R., Wang, Y. (2022). Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-20068-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)