Skip to main content
Log in

Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

This paper presents a parameter-free method for 3D human pose estimation via the Laplacian decomposition-based transformer. The non-local interactions between 3D mesh vertices of the whole body are effectively estimated in transformer-based approaches while the graph model also has begun to be embedded into the transformer for consideration of neighborhood interactions in the kinematic topology. Even though such combination has shown the remarkable progress in 3D human pose estimation, scale-aware relationships between body parts are not sufficiently explored in literature. To supplement this point, we propose to apply the Laplacian pyramid module to the transformer, which decomposes encoded features into Laplacian residuals of different scale spaces. Furthermore, we separately compute self-attentions according to body parts for generating more natural human poses. Experimental results on benchmark datasets show that the proposed method successfully improves the performance of 3D human pose estimation. The code and model are publicly available at: https://github.com/DCVL-3D/Laphormer_release.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE conference on computer vision and pattern recognition, pp. 3686–3693. Columbus, OH, USA (2014)

    Google Scholar 

  2. Anguelov, D., Srinivasan, P., Koller, D., et al.: SCAPE: shape completion and animation of people. ACM Trans. Graph 24(3), 408–416 (2005)

    Article  Google Scholar 

  3. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, MJ.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European conference on computer vision, pp. 561–578. Amsterdam, Netherlands, (2016)

    Google Scholar 

  4. Choi, H., Moon, G., Lee, KM.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: European conference on Computer vision, virtual, pp. 769–787 (2020)

    Google Scholar 

  5. Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, pp. 519–534. Springer (2016)

  6. Ionescu, C., Papava, D., Olaru, V., et al.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Article  Google Scholar 

  7. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: IEEE conference computer vision and pattern recognition, pp. 7122–7131. Salt Lake City, UT, USA (2018)

    Google Scholar 

  8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. International conference on learning representations. In: International conference on learning representations, pp. 1–13. San Diego, CA, USA (2015)

    Google Scholar 

  9. Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: IEEE Conference on computer vision and pattern recognition, virtual, pp. 5253–5263 (2020)

  10. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation, In: International conference on computer vision, virtual, pp. 11127–11137 (2021)

  11. Kolotouros, N., Pavlakos, G., Black, MJ., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: International conference on computer vision, pp. 2252–2261, Seoul, Korea (2019)

  12. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction, In: IEEE conference on computer vision and pattern recognition, pp. 4501–4510. Long Beach, CA, USA (2019)

  13. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: IEEE Conference on computer vision and pattern recognition, pp. 624–632. Honolulu, HI, USA (2017)

  14. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: IEEE conference on computer vision and pattern recognition, pp. 6050–6059. Honolulu, HI, USA (2017)

  15. Lim, S., Kim, W.: DSLR: deep stacked Laplacian restorer for low-light image enhancement. IEEE Trans. Multimed. 23, 4272–4284 (2020)

    Article  Google Scholar 

  16. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 1954–1963 (2021)

  17. Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: International conference on computer vision, virtual, pp. 12939–12948 (2021)

  18. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft COCO: common objects in context. In: European conference on computer vision, pp. 740–755. Zurich, Switzerland (2014)

  19. Loper, M., Mahmood, N., Romero, J., et al.: SMPL: a skinned multi-person linear model. ACM Trans. Graph 34(6), 2481–24816 (2015)

    Article  Google Scholar 

  20. Marcard, T.V., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: European conference on computer vision, pp. 601–617 (2018), Munich, Germany

  21. Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3D pose estimation from monocular RGB. In: international conference on 3D vision, pp. 120–130. Verona, Italy (2018)

  22. Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: European conference on computer vision, virtual, pp. 752–768 (2020)

  23. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Advances in neural information processing systems, pp. 1–4. Long Beach, CA, USA (2017)

  24. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: IEEE conference on computer vision and pattern recognition, pp. 10975–10985. Long Beach, CA, USA (2019)

  25. Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3D faces using convolutional mesh autoencoders. In: European conference on computer vision, pp. 704–720. Munich, Germany (2018)

  26. Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: International conference on computer vision virtual, pp. 11179–11188 (2021)

  27. Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: IEEE conference on computer vision and pattern recognition, pp. 13243–13252. New Orleans, Louisiana, USA (2022)

  28. Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder–decoder with multi-level attention for 3D human shape and pose estimation. In: International conference on computer vision virtual, pp. 13033–13042 (2021)

  29. Wang, J., Sun, K., Cheng, T., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2021)

    Article  PubMed  Google Scholar 

  30. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: European conference on computer vision, Munich, pp. 52–67. Munich, Germany (2018)

  31. Wei, W.L., Lin, J.C., Liu, T.L., Liao, H.Y.M.: Capturing humans in motion: temporal-attentive 3d human pose and shape estimation from monocular video, In: IEEE conference on computer vision and pattern recognition, pp. 13211–13220. New Orleans, Louisiana, USA (2022)

  32. Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE conference on computer vision and pattern recognition, pp. 11101–11111. New Orleans, Louisiana, USA (2022)

  33. Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: International conference on computer vision, virtual, pp. 11446–11456 (2021)

  34. Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 7376–7385 (2020)

Download references

Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-02084, eXtended Reality and Volumetric media generation and transmission technology for immersive experience sharing in noncontact environment with a Korea-EU international cooperative research).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wonjun Kim.

Additional information

Communicated by J. Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Kwon, H., Lim, S.Y. et al. Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation. Multimedia Systems 30, 20 (2024). https://doi.org/10.1007/s00530-023-01216-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01216-5

Keywords

Navigation