Skip to main content
Log in

VTP: volumetric transformer for multi-view multi-person 3D pose estimation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper presents Volumetric Transformer Pose Estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets generated during the current study are available in the CMU Panoptic Dataset(domedb.perception.cs.cmu.edu), Campus (https://www.epfl.ch/labs/cvlab/data/data-pom-index-php/) and Shelf(http://campar.in.tum.de/Chair/MultiHumanPose.

References

  1. Zhang, J. Cai, Y. Yan, S. Feng, J. et al. (2021) Direct multi-view multi-person 3d pose estimation. Advances in Neural Information Processing Systems 34

  2. Dong, J ,Jiang, W. Huang, Q. Bao, H. Zhou, X. (2019) Fast and robust multi-person 3d pose estimation from multiple views. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7792–7801

  3. Huang, C. Jiang, S. Li, Y. Zhang, Z. Traish, J. Deng, C. Ferguson, S. Xu, R.Y.D. (2020) End-to-end dynamic matching network for multi-view multi-person 3d pose estimation. In: European Conference on Computer Vision, pp 477–493

  4. Tu, H. Wang, C. Zeng, W. (2020) Voxelpose Towards multi-camera 3d human pose estimation in wild environment. In: European Conference on Computer Vision, pp 197–212. Springer

  5. Wu, S. Jin, S. Liu, W. Bai, L. Qian, C. Liu, D. Ouyang, W. (2021) Graph-based 3d multi-person pose estimation using multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11148–11157

  6. Iskakov, K. Burkov, E. Lempitsky, V. Malkov, Y. (2019) Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7718–7727

  7. Qiu, H. Wang, C. Wang, J. Wang, N. Zeng, W. (2019) Cross view fusion for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4342–4351

  8. Li Z, Oskarsson M, Heyden A (2022) Detailed 3d human body reconstruction from multi-view images combining voxel super-resolution and learned implicit representation. Applied Intelligence 52(6):6739–6759

    Article  Google Scholar 

  9. Zhang, Y. An, L. Yu, T. Li, X. Li, K. Liu, Y. (2020) 4d association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1324–1333

  10. Vaswani, A. Shazeer, N. Parmar, N. Uszkoreit, J. Jones, L. Gomez, A.N. Kaiser, L. Polosukhin, I. (2017) Attention is all you need. Advances in neural information processing systems 30

  11. Carion, N. Massa, F. Synnaeve, G. Usunier, N. Kirillov, A. Zagoruyko, S. (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, pp 213–229

  12. Zhu, X. Su, W. Lu, L. Li, B. Wang, X. Dai, J. (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159

  13. Mao, J. Xue, Y. Niu, M. Bai, H. Feng, J. Liang, X. Xu, H. Xu, C. (2021) Voxel transformer for 3d object detection. In:Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3164–3173

  14. Guan, T. Wang, J. Lan, S. Chandra, R. Wu, Z. Davis, L. Manocha, D. (2022) M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 772–782

  15. Dosovitskiy, A. Beyer, L. Kolesnikov, A. Weissenborn, D. Zhai, X. Unterthiner, T. (2020) Dehghani, M. Minderer, M. Heigold, G. Gelly, S. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  16. Liu, Z. Lin, Y. Cao, Y. Hu, H. Wei, Y. Zhang, Z. Lin, S. Guo, B. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022

  17. Li, N. Chen, Y. Li, W. Ding, Z. Zhao, D. (2022) Bvit: Broad attention based vision transformer. arXiv:2202.06268

  18. Sun, C. Myers, A. Vondrick, C. Murphy, K. Schmid, C. (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7464–7473

  19. Girdhar, R. Carreira, J. Doersch, C. Zisserman, A. (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253

  20. Lee, S. Yu, Y. Kim, G. Breuel, T. Kautz, J. Song, Y. (2020) Parameter efficient multimodal transformers for video representation learning. arXiv preprint arXiv:2012.04124

  21. Wang, Y. Xu, Z. Wang, X. Shen, C. Cheng, B. Shen, H. Xia, H. (2021) End-to-end video instance segmentation with transformers. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8741–8750

  22. Yang, F. Yang, H. Fu, J. Lu, H. Guo, B. (2020) Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5791–5800

  23. Qiu, C. Yao, Y. Du, Y. (2021) Nested dense attention network for single image super-resolution. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp 250–258

  24. Cai, R. Ding, Y.Lu, H. (2021) Freqnet: A frequency-domain image super-resolution network with dicrete cosine transform. arXiv:2111.10800

  25. Mao, W. Ge, Y. Shen, C. Tian, Z. Wang, X. Wang, Z. Hengel, A.v.d. (2022) Poseur: Direct human pose regression with transformers. arXiv:2201.07412

  26. Mao, W. Ge, Y. Shen, C. Tian, Z. Wang, X. Wang, Z. (2021) Tfpose: Direct human pose estimation with transformers. arXiv:2103.15320

  27. Li, K. Wang, S. Zhang, X. Xu, Y. Xu, W. Tu, Z. (2021) Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1944–1953

  28. Stoffl, L. Vidal, M. Mathis, A. (2021) End-to-end trainable multi-instance pose estimation with transformers. arXiv:2103.12115

  29. Lin, K. Wang, L. Liu, Z. (2021) End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1954–1963

  30. Sun, X. Xiao, B. Wei, F. Liang, S. Wei, Y. (2018) Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 529–545

  31. Devlin, J. Chang, M.-W. Lee, K. Toutanova, K. (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  32. Adams, R.P. Zemel, R.S. (2011) Ranking via sinkhorn propagation. arXiv:1106.1925

  33. Tay, Y. Bahri, D. Yang, L. Metzler, D. Juan, D.-C. (2020) Sparse sinkhorn attention. In: International Conference on Machine Learning, pp 9438–9447. PMLR

  34. Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2):876–879

    Article  MathSciNet  MATH  Google Scholar 

  35. Fournier, Q. Caron, G.M. Aloise, D. (2021) A practical survey on faster and lighter transformers. arXiv:2103.14636

  36. Chu, H. Lee, J.-H. Lee, Y.-C. Hsu, C.-H. Li, J.-D. Chen, C.-S. (2021) Part-aware measurement for robust multi-view multi-human 3d pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1472–1481

  37. Zhang, Y. Wang, C. Wang, X. Liu, W. Zeng, W. (2021) Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. arXiv:2108.02452

  38. Kitaev, N. Kaiser, L. Levskaya, A. (2020) Reformer: The efficient transformer. arXiv:2001.04451

Download references

Funding

This work was supported by the National Key Research and Development Program under Grant No. 2019YFC0118404, National Natural Science Foundation of China under Grant No.U20A20386, Zhejiang Provincial Science and Technology Program in China under Grant No. LQ22F020026, Zhejiang Natural Science Foundation under Grant No. QY19E050003, Zhejiang Key Research and Development Program under Grant No.2023C03194, Zhejiang Key Research and Development Program under Grant No. 2020C01050, the Key Laboratory fund general project under Grant No. 6142110190406, the key open project of 32 CETC under Grant No. 22060207026, the Open project of State Key Laboratory of CAD & CG at Zhejiang University under Grant No. A2212, Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant No.GK219909299001-028.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Renshu Gu or Gangyong Jia.

Ethics declarations

Conflicts of interest/Competing interests

Gangyong Jia has received research grants from the National Natural Science Foundation of China, the \(32^{nd}\) Research Institute of China Electronics Technology Group Corporation (CETC), National University of Defense Technology and Zhejiang and Provincial Natural Science Foundation of China. Renshu Gu has received research grants from the Zhejiang Provincial Natural Science Foundation of China, the State Key Laboratory of CAD & CG at Zhejiang and Hangzhou Dianzi University.The authors declare they have no non-financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Gu, R., Huang, O. et al. VTP: volumetric transformer for multi-view multi-person 3D pose estimation. Appl Intell 53, 26568–26579 (2023). https://doi.org/10.1007/s10489-023-04805-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04805-z

Keywords

Navigation