Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection

Ye, Hang; Zhu, Wentao; Wang, Chunyu; Wu, Rujie; Wang, Yizhou

doi:10.1007/978-3-031-20068-7_9

Hang Ye¹²,
Wentao Zhu^13,14,
Chunyu Wang¹⁵,
Rujie Wu^13,14 &
…
Yizhou Wang^13,14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

European Conference on Computer Vision

2024 Accesses
9 Citations

Abstract

While the voxel-based methods have achieved promising results for multi-person 3D pose estimation from multi-cameras, they suffer from heavy computation burdens, especially for large scenes. We present Faster VoxelPose to address the challenge by re-projecting the feature volume to the three two-dimensional coordinate planes and estimating X, Y, Z coordinates from them separately. To that end, we first localize each person by a 3D bounding box by estimating a 2D box and its height based on the volume features projected to the xy-plane and z-axis, respectively. Then for each person, we estimate partial joint coordinates from the three coordinate planes separately which are then fused to obtain the final 3D pose. The method is free from costly 3D-CNNs and improves the speed of VoxelPose by ten times and meanwhile achieves competitive accuracy as the state-of-the-art methods, proving its potential in real-time applications.

H. Ye and W. Zhu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahuja, K., Ofek, E., Gonzalez-Franco, M., Holz, C., Wilson, A.D.: Coolmoves: user motion accentuation in virtual reality. In: IMWUT (2021)
Google Scholar
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: CVPR (2014)
Google Scholar
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2015)
Google Scholar
Belagiannis, V., Wang, X., Schiele, B., Fua, P., Ilic, S., Navab, N.: Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 742–754. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16178-5_52
Chapter Google Scholar
Bridgeman, L., Volino, M., Guillemaut, J.Y., Hilton, A.: Multi-person 3D pose estimation and tracking in sports. In: CVPR Workshops (2019)
Google Scholar
Bultmann, S., Behnke, S.: Real-time multi-view 3D human pose estimation using semantic feedback to smart edge sensors. In: RSS (2021)
Google Scholar
Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-Aware networks for 3D human pose estimation in video. In: ICCV (2019)
Google Scholar
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing network structure for 3D human pose estimation. In: ICCV (2019)
Google Scholar
Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Fast and robust multi-person 3D pose estimation from multiple views (2019)
Google Scholar
Dong, J., Shuai, Q., Zhang, Y., Liu, X., Zhou, X., Bao, H.: Motion capture from internet videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 210–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_13
Chapter Google Scholar
Ershadi-Nasab, S., Noury, E., Kasaei, S., Sanaei, E.: Multiple human 3D pose estimation from multiview images. Multimed. Tools Appl. 77(12), 15573–15601 (2017). https://doi.org/10.1007/s11042-017-5133-8
Article Google Scholar
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: CVPR (2020)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Google Scholar
Hartley, R., Zisserman, A.: Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, USA (2003)
Google Scholar
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: CVPR, pp. 7779–7788 (2020)
Google Scholar
Huang, C., et al.: End-to-end dynamic matching network for multi-view multi-person 3D pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 477–493. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_29
Chapter Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose (2019)
Google Scholar
Jansen, Y., Hornbæk, K.: How relevant are incidental power poses for HCI? In: CHI (2018)
Google Scholar
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: ICCV (2015)
Google Scholar
Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: CVPR (2019)
Google Scholar
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: ICCV (2021)
Google Scholar
Lin, J., Lee, G.H.: Multi-view multi-person 3D pose estimation with plane sweep stereo. In: CVPR (2021)
Google Scholar
Liu, F., Liu, X.: Voxel-based 3D detection and reconstruction of multiple objects from a single image. In: NeurIPS (2021)
Google Scholar
Ma, X., Su, J., Wang, C., Ci, H., Wang, Y.: Context modeling in 3D human pose estimation: a unified perspective. In: CVPR (2021)
Google Scholar
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera (2017)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Harvesting multiple views for marker-less 3D human pose annotations. In: CVPR (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)
Google Scholar
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross view fusion for 3D human pose estimation. In: ICCV (2019)
Google Scholar
Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A., Murphy, K., Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: CVPR (2016)
Google Scholar
Reddy, N.D., Guigues, L., Pischulini, L., Eledath, J., Narasimhan, S.: Tessetrack: end-to-end learnable multi-person articulated 3D pose tracking. In: CVPR (2021)
Google Scholar
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G.: Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15190–15200 (2021)
Google Scholar
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.: Lightweight multi-view 3D pose estimation through camera-disentangled representation. In: CVPR (2020)
Google Scholar
Rukhovich, D., Vorontsova, A., Konushin, A.: ImvoxelNet: image to voxels projection for monocular and multi-view general-purpose 3d object detection. In: WACV (2022)
Google Scholar
Sharma, S., Varigonda, P.T., Bindal, P., Sharma, A., Jain, A.: Monocular 3D human pose estimation by generation and ordinal ranking. In: ICCV (2019)
Google Scholar
Shen, X., et al.: Towards fast and accurate multi-person pose estimation on mobile devices. In: IJCAI (2021)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: CVPR (2019)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Google Scholar
Tu, H., Wang, C., Zeng, W.: VoxelPose: towards multi-camera 3D human pose estimation in wild environment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 197–212. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_12
Chapter Google Scholar
Wang, C., Flynn, J., Wang, Y., Yuille, A.: Recognizing actions in 3D using action-snippets and activated simplices. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Wang, C., Wang, Y., Lin, Z., Yuille, A.L.: Robust 3D human pose estimation from single images or video sequences. IEEE Trans. Pattern Anal. Mach. Intell. 41(5), 1227–1241 (2018)
Article Google Scholar
Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2361–2368 (2014)
Google Scholar
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)
Google Scholar
Wang, C., Wang, Y., Yuille, A.L.: Mining 3D key-pose-motifs for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2639–2647 (2016)
Google Scholar
Wang, T., Zhang, J., Cai, Y., Yan, S., Feng, J.: Direct multi-view multi-person 3D human pose estimation. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Wang, X., Ang, M.H., Lee, G.H.: Voxel-based network for shape completion by leveraging edge generation. In: ICCV (2021)
Google Scholar
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)
Google Scholar
Wu, S., et al.: Graph-Based 3D Multi-Person pose estimation using Multi-View images. In: ICCV (2021)
Google Scholar
Xu, J., Zhong, F., Wang, Y.: Learning multi-agent coordination for enhancing target coverage in directional sensor networks. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Xu, L., et al.: ViPNAS: efficient video pose estimation via neural architecture search. In: CVPR (2021)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Google Scholar
Yu, Z., et al.: HUMBI: a large multiview dataset of human body expressions. In: CVPR (2020)
Google Scholar
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR (2019)
Google Scholar
Zhang, S., Staudt, E., Faltemier, T., Roy-chowdhury, A.K.: A camera network tracking (camnet) dataset and performance baseline. In: WACV (2015)
Google Scholar
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. arXiv preprint arXiv:2108.02452
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, W.: Voxeltrack: multi-person 3D human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
Google Scholar
Zhang, Y., An, L., Yu, T., Li, x., Li, K., Liu, Y.: 4D association graph for realtime multi-person motion capture using multiple video cameras. In: CVPR (2020)
Google Scholar
Zhang, Z., Wang, C., Qin, W., Zeng, W.: Fusing wearable IMUs with multi-view images for human pose estimation: a geometric approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2200–2209 (2020)
Google Scholar
Zhang, Z., Wang, C., Qiu, W., Qin, W., Zeng, W.: AdaFuse: adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vision 129(3), 703–718 (2021)
Article Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: ICCV (2021)
Google Scholar
Zhong, Y., Zhu, M., Peng, H.: VIN: voxel-based implicit network for joint 3D object detection and segmentation for lidars (2021)
Google Scholar
Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hemlets pose: learning part-centric heatmap triplets for accurate 3D human pose estimation. In: ICCV (2019)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: CVPR (2019)
Google Scholar
Zhu, L., Rematas, K., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Reconstructing NBA players. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 177–194. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_11
Chapter Google Scholar

Download references

Acknowledgement

This work was supported in part by MOST-2018AAA0102004 and NSFC-62061136001.

Author information

Authors and Affiliations

Yuanpei College, Peking University, Beijing, China
Hang Ye
Center on Frontiers of Computing Studies, Peking University, Beijing, China
Wentao Zhu, Rujie Wu & Yizhou Wang
School of Computer Science, Peking University, Beijing, China
Wentao Zhu, Rujie Wu & Yizhou Wang
Microsoft Research Asia, Beijing, China
Chunyu Wang
Institute for Artificial Intelligence, Peking University, Beijing, China
Yizhou Wang

Authors

Hang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rujie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunyu Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 14233 KB)

Supplementary material 2 (pdf 3407 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, H., Zhu, W., Wang, C., Wu, R., Wang, Y. (2022). Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-20068-7_9
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection