Abstract
This paper presents a parameter-free method for 3D human pose estimation via the Laplacian decomposition-based transformer. The non-local interactions between 3D mesh vertices of the whole body are effectively estimated in transformer-based approaches while the graph model also has begun to be embedded into the transformer for consideration of neighborhood interactions in the kinematic topology. Even though such combination has shown the remarkable progress in 3D human pose estimation, scale-aware relationships between body parts are not sufficiently explored in literature. To supplement this point, we propose to apply the Laplacian pyramid module to the transformer, which decomposes encoded features into Laplacian residuals of different scale spaces. Furthermore, we separately compute self-attentions according to body parts for generating more natural human poses. Experimental results on benchmark datasets show that the proposed method successfully improves the performance of 3D human pose estimation. The code and model are publicly available at: https://github.com/DCVL-3D/Laphormer_release.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE conference on computer vision and pattern recognition, pp. 3686–3693. Columbus, OH, USA (2014)
Anguelov, D., Srinivasan, P., Koller, D., et al.: SCAPE: shape completion and animation of people. ACM Trans. Graph 24(3), 408–416 (2005)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, MJ.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: European conference on computer vision, pp. 561–578. Amsterdam, Netherlands, (2016)
Choi, H., Moon, G., Lee, KM.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: European conference on Computer vision, virtual, pp. 769–787 (2020)
Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for semantic segmentation. In: European conference on computer vision, pp. 519–534. Springer (2016)
Ionescu, C., Papava, D., Olaru, V., et al.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: IEEE conference computer vision and pattern recognition, pp. 7122–7131. Salt Lake City, UT, USA (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. International conference on learning representations. In: International conference on learning representations, pp. 1–13. San Diego, CA, USA (2015)
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: IEEE Conference on computer vision and pattern recognition, virtual, pp. 5253–5263 (2020)
Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: part attention regressor for 3D human body estimation, In: International conference on computer vision, virtual, pp. 11127–11137 (2021)
Kolotouros, N., Pavlakos, G., Black, MJ., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: International conference on computer vision, pp. 2252–2261, Seoul, Korea (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction, In: IEEE conference on computer vision and pattern recognition, pp. 4501–4510. Long Beach, CA, USA (2019)
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: IEEE Conference on computer vision and pattern recognition, pp. 624–632. Honolulu, HI, USA (2017)
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: IEEE conference on computer vision and pattern recognition, pp. 6050–6059. Honolulu, HI, USA (2017)
Lim, S., Kim, W.: DSLR: deep stacked Laplacian restorer for low-light image enhancement. IEEE Trans. Multimed. 23, 4272–4284 (2020)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 1954–1963 (2021)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: International conference on computer vision, virtual, pp. 12939–12948 (2021)
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft COCO: common objects in context. In: European conference on computer vision, pp. 740–755. Zurich, Switzerland (2014)
Loper, M., Mahmood, N., Romero, J., et al.: SMPL: a skinned multi-person linear model. ACM Trans. Graph 34(6), 2481–24816 (2015)
Marcard, T.V., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: European conference on computer vision, pp. 601–617 (2018), Munich, Germany
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-shot multi-person 3D pose estimation from monocular RGB. In: international conference on 3D vision, pp. 120–130. Verona, Italy (2018)
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: European conference on computer vision, virtual, pp. 752–768 (2020)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Advances in neural information processing systems, pp. 1–4. Long Beach, CA, USA (2017)
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: IEEE conference on computer vision and pattern recognition, pp. 10975–10985. Long Beach, CA, USA (2019)
Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3D faces using convolutional mesh autoencoders. In: European conference on computer vision, pp. 704–720. Munich, Germany (2018)
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M.J., Mei, T.: Monocular, one-stage, regression of multiple 3D people. In: International conference on computer vision virtual, pp. 11179–11188 (2021)
Sun, Y., Liu, W., Bao, Q., Fu, Y., Mei, T., Black, M.J.: Putting people in their place: monocular regression of 3D people in depth. In: IEEE conference on computer vision and pattern recognition, pp. 13243–13252. New Orleans, Louisiana, USA (2022)
Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder–decoder with multi-level attention for 3D human shape and pose estimation. In: International conference on computer vision virtual, pp. 13033–13042 (2021)
Wang, J., Sun, K., Cheng, T., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2021)
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: European conference on computer vision, Munich, pp. 52–67. Munich, Germany (2018)
Wei, W.L., Lin, J.C., Liu, T.L., Liao, H.Y.M.: Capturing humans in motion: temporal-attentive 3d human pose and shape estimation from monocular video, In: IEEE conference on computer vision and pattern recognition, pp. 13211–13220. New Orleans, Louisiana, USA (2022)
Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE conference on computer vision and pattern recognition, pp. 11101–11111. New Orleans, Louisiana, USA (2022)
Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: International conference on computer vision, virtual, pp. 11446–11456 (2021)
Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: IEEE conference on computer vision and pattern recognition, virtual, pp. 7376–7385 (2020)
Acknowledgements
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-02084, eXtended Reality and Volumetric media generation and transmission technology for immersive experience sharing in noncontact environment with a Korea-EU international cooperative research).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by J. Gao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kim, J., Kwon, H., Lim, S.Y. et al. Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation. Multimedia Systems 30, 20 (2024). https://doi.org/10.1007/s00530-023-01216-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-023-01216-5