Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

Kim, Jeonghwan; Kwon, Hyukmin; Lim, Seong Yong; Kim, Wonjun

doi:10.1007/s00530-023-01216-5

Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

Regular Paper
Published: 17 January 2024

Volume 30, article number 20, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Jeonghwan Kim¹,
Hyukmin Kwon²,
Seong Yong Lim² &
…
Wonjun Kim³

156 Accesses
Explore all metrics

Abstract

This paper presents a parameter-free method for 3D human pose estimation via the Laplacian decomposition-based transformer. The non-local interactions between 3D mesh vertices of the whole body are effectively estimated in transformer-based approaches while the graph model also has begun to be embedded into the transformer for consideration of neighborhood interactions in the kinematic topology. Even though such combination has shown the remarkable progress in 3D human pose estimation, scale-aware relationships between body parts are not sufficiently explored in literature. To supplement this point, we propose to apply the Laplacian pyramid module to the transformer, which decomposes encoded features into Laplacian residuals of different scale spaces. Furthermore, we separately compute self-attentions according to body parts for generating more natural human poses. Experimental results on benchmark datasets show that the proposed method successfully improves the performance of 3D human pose estimation. The code and model are publicly available at: https://github.com/DCVL-3D/Laphormer_release.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

Stacked Hourglass Networks for Human Pose Estimation

LSD-SLAM: Large-Scale Direct Monocular SLAM

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE conference on computer vision and pattern recognition, pp. 3686–3693. Columbus, OH, USA (2014)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., et al.: SCAPE: shape completion and animation of people. ACM Trans. Graph 24(3), 408–416 (2005)
Article Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, MJ.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European conference on computer vision, pp. 561–578. Amsterdam, Netherlands, (2016)
Google Scholar
Choi, H., Moon, G., Lee, KM.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: European conference on Computer vision, virtual, pp. 769–787 (2020)
Google Scholar
Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, pp. 519–534. Springer (2016)
Ionescu, C., Papava, D., Olaru, V., et al.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Article Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: IEEE conference computer vision and pattern recognition, pp. 7122–7131. Salt Lake City, UT, USA (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. International conference on learning representations. In: International conference on learning representations, pp. 1–13. San Diego, CA, USA (2015)
Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: IEEE Conference on computer vision and pattern recognition, virtual, pp. 5253–5263 (2020)
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation, In: International conference on computer vision, virtual, pp. 11127–11137 (2021)
Kolotouros, N., Pavlakos, G., Black, MJ., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: International conference on computer vision, pp. 2252–2261, Seoul, Korea (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction, In: IEEE conference on computer vision and pattern recognition, pp. 4501–4510. Long Beach, CA, USA (2019)
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: IEEE Conference on computer vision and pattern recognition, pp. 624–632. Honolulu, HI, USA (2017)
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: IEEE conference on computer vision and pattern recognition, pp. 6050–6059. Honolulu, HI, USA (2017)
Lim, S., Kim, W.: DSLR: deep stacked Laplacian restorer for low-light image enhancement. IEEE Trans. Multimed. 23, 4272–4284 (2020)
Article Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 1954–1963 (2021)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: International conference on computer vision, virtual, pp. 12939–12948 (2021)
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft COCO: common objects in context. In: European conference on computer vision, pp. 740–755. Zurich, Switzerland (2014)
Loper, M., Mahmood, N., Romero, J., et al.: SMPL: a skinned multi-person linear model. ACM Trans. Graph 34(6), 2481–24816 (2015)
Article Google Scholar
Marcard, T.V., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: European conference on computer vision, pp. 601–617 (2018), Munich, Germany
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3D pose estimation from monocular RGB. In: international conference on 3D vision, pp. 120–130. Verona, Italy (2018)
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: European conference on computer vision, virtual, pp. 752–768 (2020)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Advances in neural information processing systems, pp. 1–4. Long Beach, CA, USA (2017)
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: IEEE conference on computer vision and pattern recognition, pp. 10975–10985. Long Beach, CA, USA (2019)
Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3D faces using convolutional mesh autoencoders. In: European conference on computer vision, pp. 704–720. Munich, Germany (2018)
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: International conference on computer vision virtual, pp. 11179–11188 (2021)
Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: IEEE conference on computer vision and pattern recognition, pp. 13243–13252. New Orleans, Louisiana, USA (2022)
Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder–decoder with multi-level attention for 3D human shape and pose estimation. In: International conference on computer vision virtual, pp. 13033–13042 (2021)
Wang, J., Sun, K., Cheng, T., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2021)
Article PubMed Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: European conference on computer vision, Munich, pp. 52–67. Munich, Germany (2018)
Wei, W.L., Lin, J.C., Liu, T.L., Liao, H.Y.M.: Capturing humans in motion: temporal-attentive 3d human pose and shape estimation from monocular video, In: IEEE conference on computer vision and pattern recognition, pp. 13211–13220. New Orleans, Louisiana, USA (2022)
Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE conference on computer vision and pattern recognition, pp. 11101–11111. New Orleans, Louisiana, USA (2022)
Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: International conference on computer vision, virtual, pp. 11446–11456 (2021)
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 7376–7385 (2020)

Download references

Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-02084, eXtended Reality and Volumetric media generation and transmission technology for immersive experience sharing in noncontact environment with a Korea-EU international cooperative research).

Author information

Authors and Affiliations

Department of Artificial Intelligence, Konkuk University, Seoul, 05029, Republic of Korea
Jeonghwan Kim
Immersive Media Research Section, Electronics and Telecommunications Research Institute, Daejeon, 34129, Republic of Korea
Hyukmin Kwon & Seong Yong Lim
The Department of Electrical and Electronics Engineering, Konkuk University, Seoul, 05029, Republic of Korea
Wonjun Kim

Authors

Jeonghwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyukmin Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Seong Yong Lim
View author publications
You can also search for this author in PubMed Google Scholar
Wonjun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wonjun Kim.

Additional information

Communicated by J. Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kim, J., Kwon, H., Lim, S.Y. et al. Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation. Multimedia Systems 30, 20 (2024). https://doi.org/10.1007/s00530-023-01216-5

Download citation

Received: 02 June 2023
Accepted: 08 December 2023
Published: 17 January 2024
DOI: https://doi.org/10.1007/s00530-023-01216-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

Abstract

Access this article

Similar content being viewed by others

Deep learning-based 3D reconstruction: a survey

Stacked Hourglass Networks for Human Pose Estimation

LSD-SLAM: Large-Scale Direct Monocular SLAM

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

Abstract

Access this article

Similar content being viewed by others

Deep learning-based 3D reconstruction: a survey

Stacked Hourglass Networks for Human Pose Estimation

LSD-SLAM: Large-Scale Direct Monocular SLAM

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation