$$ST^2PE$$ : Spatial and Temporal Transformer for Pose Estimation

Wu, Yuan; Cai, Yanlu; Feng, Rui; Jin, Cheng

doi:10.1007/978-3-031-15931-2_55

Yuan Wu¹²,
Yanlu Cai¹²,
Rui Feng¹² &
…
Cheng Jin¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13530))

Included in the following conference series:

International Conference on Artificial Neural Networks

2243 Accesses

Abstract

Spatial and temporal information and multi-view mutual promotion are two keys to solve depth uncertainty for 2D-to-3D Human pose estimation. However, each key requires a huge model to handle, so previous methods have to choose one of them to solve depth uncertainty. Thanks to transformer’s powerful long-term relationship modeling capability and latent adaptation of one view to another, we can adopt both with an acceptable number of model parameters. To the best of our knowledge, we are the first to propose a multi-view multi-frame method for 2D-to-3D human pose estimation, called $ST^2PE$. The proposed method is combined with a Single-view Spatial and Temporal Transformer ($S{-}\textrm{ST}^2$) and a Multiple Cross-view Transformer-based Transmission ($M{-}\textrm{CT}^2$) module. Inspired by the structure of the Transformer in Transformer, the single-view spatial and temporal transformer has two nested transformer modules. The internal one extracts spatial information between joints, while the external one extracts temporal information between frames. The cross-view transformer-based transmission module extends the original encoder-decoder setting of Transformer into a multi-view setting. Experiments on three mainstream datasets (Human3.6M, HumanEva, MPI-INF-3DHP), demonstrate that $ST^2PE$ achieves state-of-the-art performance among all the 2D-to-3D methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cai, Y., et al.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2272–2281 (2019)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)
He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7779–7788 (2020)
Google Scholar
Huang, F., Zeng, A., Liu, M., Lai, Q., Xu, Q.: Deepfuse: an imu-aware network for real-time 3d human pose estimation from multi-view image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 429–438 (2020)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (jul 2014)
Google Scholar
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Li, W., Liu, H., Ding, R., Liu, M., Wang, P.: Lifting transformer for 3d human pose estimation in video. arXiv preprint arXiv:2103.14304 (2021)
Liu, K., Ding, R., Zou, Z., Wang, L., Tang, W.: A comprehensive study of weight sharing in graph networks for 3d human pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 318–334. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_19
Chapter Google Scholar
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single RGB Image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44
Chapter Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G.: Tessetrack: end-to-end learnable multi-person articulated 3d pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15190–15200 (2021)
Google Scholar
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Tome, D., Toso, M., Agapito, L., Russell, C.: Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In: 2018 International Conference on 3D Vision (3DV), pp. 474–483. IEEE (2018)
Google Scholar
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 2019, p. 6558. NIH Public Access (2019)
Google Scholar
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 764–780. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_45
Chapter Google Scholar
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019)
Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. arXiv preprint arXiv:2103.10455 (2021)

Download references

Acknowledgment

This work was supported by National Natural Science Foundation of China under Grant (No. 62176064) and Zhejiang Lab (No. 2019KD0AB06).

Author information

Authors and Affiliations

Fudan University, Shanghai Handan Road 220, Shanghai, China
Yuan Wu, Yanlu Cai, Rui Feng & Cheng Jin

Authors

Yuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yanlu Cai
View author publications
You can also search for this author in PubMed Google Scholar
Rui Feng
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Jin .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Y., Cai, Y., Feng, R., Jin, C. (2022). $ST^2PE$: Spatial and Temporal Transformer for Pose Estimation. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13530. Springer, Cham. https://doi.org/10.1007/978-3-031-15931-2_55

Download citation

DOI: https://doi.org/10.1007/978-3-031-15931-2_55
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15930-5
Online ISBN: 978-3-031-15931-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

\(ST^2PE\): Spatial and Temporal Transformer for Pose Estimation

Abstract

Access this chapter

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

\(ST^2PE\): Spatial and Temporal Transformer for Pose Estimation

Abstract

Access this chapter

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation