VTP: volumetric transformer for multi-view multi-person 3D pose estimation

Chen, Yuxing; Gu, Renshu; Huang, Ouhan; Jia, Gangyong

doi:10.1007/s10489-023-04805-z

VTP: volumetric transformer for multi-view multi-person 3D pose estimation

Published: 25 August 2023

Volume 53, pages 26568–26579, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yuxing Chen¹,
Renshu Gu¹,
Ouhan Huang^1,2 &
…
Gangyong Jia¹

284 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents Volumetric Transformer Pose Estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

Article 16 February 2024

An efficient and accurate 2D human pose estimation method using VTTransPose network

Article Open access 31 March 2024

T6D-Direct: Transformers for Multi-object 6D Pose Direct Regression

Data Availability

The datasets generated during the current study are available in the CMU Panoptic Dataset(domedb.perception.cs.cmu.edu), Campus (https://www.epfl.ch/labs/cvlab/data/data-pom-index-php/) and Shelf(http://campar.in.tum.de/Chair/MultiHumanPose.

References

Zhang, J. Cai, Y. Yan, S. Feng, J. et al. (2021) Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems 34
Dong, J ,Jiang, W. Huang, Q. Bao, H. Zhou, X. (2019) Fast and robust multi-person 3d pose estimation from multiple views. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7792–7801
Huang, C. Jiang, S. Li, Y. Zhang, Z. Traish, J. Deng, C. Ferguson, S. Xu, R.Y.D. (2020) End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In: European Conference on Computer Vision, pp 477–493
Tu, H. Wang, C. Zeng, W. (2020) Voxelpose Towards multi-camera 3d human pose estimation in wild environment. In: European Conference on Computer Vision, pp 197–212. Springer
Wu, S. Jin, S. Liu, W. Bai, L. Qian, C. Liu, D. Ouyang, W. (2021) Graph-based 3d multi-person pose estimation using multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11148–11157
Iskakov, K. Burkov, E. Lempitsky, V. Malkov, Y. (2019) Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7718–7727
Qiu, H. Wang, C. Wang, J. Wang, N. Zeng, W. (2019) Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4342–4351
Li Z, Oskarsson M, Heyden A (2022) Detailed 3d human body reconstruction from multi-view images combining voxel super-resolution and learned implicit representation. Applied Intelligence 52(6):6739–6759
Article Google Scholar
Zhang, Y. An, L. Yu, T. Li, X. Li, K. Liu, Y. (2020) 4d association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1324–1333
Vaswani, A. Shazeer, N. Parmar, N. Uszkoreit, J. Jones, L. Gomez, A.N. Kaiser, L. Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems 30
Carion, N. Massa, F. Synnaeve, G. Usunier, N. Kirillov, A. Zagoruyko, S. (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, pp 213–229
Zhu, X. Su, W. Lu, L. Li, B. Wang, X. Dai, J. (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159
Mao, J. Xue, Y. Niu, M. Bai, H. Feng, J. Liang, X. Xu, H. Xu, C. (2021) Voxel transformer for 3d object detection. In:Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3164–3173
Guan, T. Wang, J. Lan, S. Chandra, R. Wu, Z. Davis, L. Manocha, D. (2022) M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 772–782
Dosovitskiy, A. Beyer, L. Kolesnikov, A. Weissenborn, D. Zhai, X. Unterthiner, T. (2020) Dehghani, M. Minderer, M. Heigold, G. Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Liu, Z. Lin, Y. Cao, Y. Hu, H. Wei, Y. Zhang, Z. Lin, S. Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Li, N. Chen, Y. Li, W. Ding, Z. Zhao, D. (2022) Bvit: Broad attention based vision transformer. arXiv:2202.06268
Sun, C. Myers, A. Vondrick, C. Murphy, K. Schmid, C. (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7464–7473
Girdhar, R. Carreira, J. Doersch, C. Zisserman, A. (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253
Lee, S. Yu, Y. Kim, G. Breuel, T. Kautz, J. Song, Y. (2020) Parameter efficient multimodal transformers for video representation learning. arXiv preprint arXiv:2012.04124
Wang, Y. Xu, Z. Wang, X. Shen, C. Cheng, B. Shen, H. Xia, H. (2021) End-to-end video instance segmentation with transformers. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8741–8750
Yang, F. Yang, H. Fu, J. Lu, H. Guo, B. (2020) Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5791–5800
Qiu, C. Yao, Y. Du, Y. (2021) Nested dense attention network for single image super-resolution. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp 250–258
Cai, R. Ding, Y.Lu, H. (2021) Freqnet: A frequency-domain image super-resolution network with dicrete cosine transform. arXiv:2111.10800
Mao, W. Ge, Y. Shen, C. Tian, Z. Wang, X. Wang, Z. Hengel, A.v.d. (2022) Poseur: Direct human pose regression with transformers. arXiv:2201.07412
Mao, W. Ge, Y. Shen, C. Tian, Z. Wang, X. Wang, Z. (2021) Tfpose: Direct human pose estimation with transformers. arXiv:2103.15320
Li, K. Wang, S. Zhang, X. Xu, Y. Xu, W. Tu, Z. (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1944–1953
Stoffl, L. Vidal, M. Mathis, A. (2021) End-to-end trainable multi-instance pose estimation with transformers. arXiv:2103.12115
Lin, K. Wang, L. Liu, Z. (2021) End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1954–1963
Sun, X. Xiao, B. Wei, F. Liang, S. Wei, Y. (2018) Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 529–545
Devlin, J. Chang, M.-W. Lee, K. Toutanova, K. (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Adams, R.P. Zemel, R.S. (2011) Ranking via sinkhorn propagation. arXiv:1106.1925
Tay, Y. Bahri, D. Yang, L. Metzler, D. Juan, D.-C. (2020) Sparse sinkhorn attention. In: International Conference on Machine Learning, pp 9438–9447. PMLR
Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2):876–879
Article MathSciNet MATH Google Scholar
Fournier, Q. Caron, G.M. Aloise, D. (2021) A practical survey on faster and lighter transformers. arXiv:2103.14636
Chu, H. Lee, J.-H. Lee, Y.-C. Hsu, C.-H. Li, J.-D. Chen, C.-S. (2021) Part-aware measurement for robust multi-view multi-human 3d pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1472–1481
Zhang, Y. Wang, C. Wang, X. Liu, W. Zeng, W. (2021) Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. arXiv:2108.02452
Kitaev, N. Kaiser, L. Levskaya, A. (2020) Reformer: The efficient transformer. arXiv:2001.04451

Download references

Funding

This work was supported by the National Key Research and Development Program under Grant No. 2019YFC0118404, National Natural Science Foundation of China under Grant No.U20A20386, Zhejiang Provincial Science and Technology Program in China under Grant No. LQ22F020026, Zhejiang Natural Science Foundation under Grant No. QY19E050003, Zhejiang Key Research and Development Program under Grant No.2023C03194, Zhejiang Key Research and Development Program under Grant No. 2020C01050, the Key Laboratory fund general project under Grant No. 6142110190406, the key open project of 32 CETC under Grant No. 22060207026, the Open project of State Key Laboratory of CAD & CG at Zhejiang University under Grant No. A2212, Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant No.GK219909299001-028.

Author information

Authors and Affiliations

The School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Yuxing Chen, Renshu Gu, Ouhan Huang & Gangyong Jia
Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai, China
Ouhan Huang

Authors

Yuxing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Renshu Gu
View author publications
You can also search for this author in PubMed Google Scholar
Ouhan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Gangyong Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Renshu Gu or Gangyong Jia.

Ethics declarations

Conflicts of interest/Competing interests

Gangyong Jia has received research grants from the National Natural Science Foundation of China, the \(32^{nd}\) Research Institute of China Electronics Technology Group Corporation (CETC), National University of Defense Technology and Zhejiang and Provincial Natural Science Foundation of China. Renshu Gu has received research grants from the Zhejiang Provincial Natural Science Foundation of China, the State Key Laboratory of CAD & CG at Zhejiang and Hangzhou Dianzi University.The authors declare they have no non-financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Gu, R., Huang, O. et al. VTP: volumetric transformer for multi-view multi-person 3D pose estimation. Appl Intell 53, 26568–26579 (2023). https://doi.org/10.1007/s10489-023-04805-z

Download citation

Accepted: 16 June 2023
Published: 25 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04805-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VTP: volumetric transformer for multi-view multi-person 3D pose estimation

Abstract

Access this article

Similar content being viewed by others

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

An efficient and accurate 2D human pose estimation method using VTTransPose network

T6D-Direct: Transformers for Multi-object 6D Pose Direct Regression

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

VTP: volumetric transformer for multi-view multi-person 3D pose estimation

Abstract

Access this article

Similar content being viewed by others

PVA-GCN: point-voxel absorbing graph convolutional network for 3D human pose estimation from monocular video

An efficient and accurate 2D human pose estimation method using VTTransPose network

T6D-Direct: Transformers for Multi-object 6D Pose Direct Regression

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation