FaSRnet: a feature and semantics refinement network for human pose estimation

Zhong, Yuanhong; Xu, Qianfeng; Zhong, Daidi; Yang, Xun; Wang, Shanshan

doi:10.1631/FITEE.2200639

FaSRnet: a feature and semantics refinement network for human pose estimation

FaSRnet: 用于人体姿态估计的特征和语义修正网络

Research Article
Published: 10 May 2024

Volume 25, pages 513–526, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Yuanhong Zhong (仲元红) ORCID: orcid.org/0000-0001-5689-1146¹,
Qianfeng Xu (徐乾锋)¹,
Daidi Zhong (钟代笛)²,
Xun Yang (杨勋)³ &
…
Shanshan Wang (王姗姗)⁴

16 Accesses
Explore all metrics

Abstract

Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue. Currently, most methods explore temporal consistency through refinements of the final heatmaps. The heatmaps contain the semantics information of key points, and can improve the detection quality to a certain extent. However, they are generated by features, and feature-level refinements are rarely considered. In this paper, we propose a human pose estimation framework with refinements at the feature and semantics levels. We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions. An attention mechanism is then used to fuse auxiliary features with current features. In terms of semantics, we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps. The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018, and the results demonstrate the effectiveness of our method.

摘要

由于运动模糊、视频失焦和遮挡等因素, 多帧人体姿态估计是一项有挑战性的任务。利用连续帧之间的时间一致性是解决这一问题的有效方法。目前, 大多数方法通过修正最终热图来利用时间一致性。热图包含了关键点的语义信息, 可在一定程度上提高检测质量。它们由特征生成, 但这些方法很少考虑特征级别的修正。本文提出一种人体姿态估计框架, 该框架在特征和语义层面进行了改进。将辅助特征与当前帧的特征对齐, 以减少不同特征分布带来的损失。使用注意力机制将辅助特征与当前特征融合。在语义方面, 使用相邻热图之间的差异作为辅助特征来修正当前热图。在大型基准数据集PoseTrack2017和PoseTrack2018上验证了该方法的有效性。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The code is available at https://github.com/Elvis-Aron/FaSRnet. The other data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. IEEE Conf on Computer Vision and Pattern Recognition, p.3686–3693. https://doi.org/10.1109/CVPR.2014.471
Andriluka M, Iqbal U, Insafutdinov E, et al., 2018. PoseTrack: a benchmark for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5167–5176. https://doi.org/10.1109/CVPR.2018.00542
Bertasius G, Feichtenhofer C, Tran D, et al., 2019. Learning temporal pose estimation from sparsely-labeled videos. Proc 33^rd Int Conf on Neural Information Processing Systems, p.3027–3038.
Cai YH, Wang ZC, Luo ZX, et al., 2020. Learning delicate local representations for multi-person pose estimation. 16^th European Conf on Computer Vision, p.455–472. https://doi.org/10.1007/978-3-030-58580-8_27
Cao Z, Hidalgo G, Simon T, et al., 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Patt Anal Mach Intell, 43(1):172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Article Google Scholar
Chu X, Yang W, Ouyang WL, et al., 2017. Multi-context attention for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.5669–5678. https://doi.org/10.1109/CVPR.2017.601
Dang YH, Yin JQ, Zhang SJ, et al., 2022a. Learning human kinematics by modeling temporal correlations between joints for video-based human pose estimation. https://doi.org/10.48550/arXiv.2207.10971
Dang YH, Yin JQ, Zhang SJ, 2022b. Relation-based associative joint location for human pose estimation in videos. IEEE Trans Image Process, 31:3973–3986. https://doi.org/10.1109/TIP.2022.3177959
Article Google Scholar
Doering A, Iqbal U, Gall J, 2018. Joint flow: temporal flow fields for multi person tracking. https://doi.org/10.48550/arXiv.1805.04596
Fang HS, Xie SQ, Tai YW, et al., 2017. RMPE: regional multiperson pose estimation. IEEE Int Conf on Computer Vision, p.2353–2362. https://doi.org/10.1109/ICCV.2017.256
Fang HS, Li JF, Tang HY, et al., 2023. AlphaPose: whole-body regional multi-person pose estimation and tracking in realtime. IEEE Trans Patt Anal Mach Intell, 45(6):7157–7173. https://doi.org/10.1109/TPAMI.2022.3222784
Article Google Scholar
Fang ZJ, López AM, 2020. Intention recognition of pedestrians and cyclists by 2D pose estimation. IEEE Trans Intell Transp Syst, 21(11):4773–4783. https://doi.org/10.1109/TITS.2019.2946642
Article Google Scholar
Girdhar R, Gkioxari G, Torresani L, et al., 2018. Detect-and-track: efficient pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.350–359. https://doi.org/10.1109/CVPR.2018.00044
Guo HK, Tang T, Luo GZ, et al., 2019. Multi-domain pose network for multi-person pose estimation and tracking. European Conf on Computer Vision, p.209–216. https://doi.org/10.1007/978-3-030-11012-3_17
Hwang J, Lee J, Park S, et al., 2019. Pose estimator and tracker using temporal flow maps for limbs. Int Joint Conf on Neural Networks, p.1–8. https://doi.org/10.1109/IJCNN.2019.8851734
Insafutdinov E, Andriluka M, Pishchulin L, et al., 2017. ArtTrack: articulated multi-person tracking in the wild. Conf on Computer Vision and Pattern Recognition, p.1293–1301. https://doi.org/10.1109/CVPR.2017.142
Iqbal U, Milan A, Gall J, 2017. PoseTrack: joint multi-person pose estimation and tracking. IEEE Conf on Computer Vision and Pattern Recognition, p.4654–4663. https://doi.org/10.1109/CVPR.2017.495
Jin S, Liu WT, Ouyang WL, et al., 2019. Multi-person articulated tracking with spatial and temporal embeddings. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5657–5666. https://doi.org/10.1109/CVPR.2019.00581
Jin S, Liu WT, Xie EZ, et al., 2020. Differentiable hierarchical graph grouping for multi-person pose estimation. 16^th European Conf on Computer Vision, p.718–734. https://doi.org/10.1007/978-3-030-58571-6_42
Li DW, Chen XT, Zhang Z, et al., 2018. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. IEEE Int Conf on Multimedia and Expo, p.1–6. https://doi.org/10.1109/ICME.2018.8486604
Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13^th European Conf on Computer Vision, p.740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Liu ZG, Wu S, Jin SY, et al., 2019. Towards natural and accurate future motion prediction of humans and animals. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9996–10004. https://doi.org/10.1109/CVPR.2019.01024
Liu ZG, Chen HM, Feng RY, et al., 2021. Deep dual consecutive network for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.525–534. https://doi.org/10.1109/CVPR46437.2021.00059
Liu ZG, Feng RY, Chen HM, et al., 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10996–11006. https://doi.org/10.1109/CVPR52688.2022.01073
Luo Y, Ren J, Wang ZX, et al., 2018. LSTM pose machines. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5207–5215. https://doi.org/10.1109/CVPR.2018.00546
Martinez J, Hossain R, Romero J, et al., 2017. A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf on Computer Vision, p.2659–2668. https://doi.org/10.1109/ICCV.2017.288
Pfister T, Charles J, Zisserman A, 2015. Flowing ConvNets for human pose estimation in videos. IEEE Int Conf on Computer Vision, p.1913–1921. https://doi.org/10.1109/ICCV.2015.222
Sapp B, Taskar B, 2013. MODEC: multimodal decomposable models for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.3674–3681. https://doi.org/10.1109/CVPR.2013.471
Shao ZP, Zhou W, Wang WZ, et al., 2023. A temporal densely connected recurrent network for event-based human pose estimation. https://doi.org/10.48550/arXiv.2209.07034
Snower M, Kadav A, Lai F, et al., 2020. 15 keypoints is all you need. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6737–6747. https://doi.org/10.1109/CVPR42600.2020.00677
Song J, Wang LM, van Gool L, et al., 2017. Thin-slicing network: a deep structured model for pose estimation in videos. IEEE Conf on Computer Vision and Pattern Recognition, p.5563–5572. https://doi.org/10.1109/CVPR.2017.590
Sun K, Xiao B, Liu D, et al., 2019. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686–5696. https://doi.org/10.1109/CVPR.2019.00584
Tian YP, Zhang YL, Fu Y, et al., 2020. TDAN: temporally-deformable alignment network for video super-resolution. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3357–3366. https://doi.org/10.1109/CVPR42600.2020.00342
Wang J, Long X, Gao Y, et al., 2020. Graph-PCNN: two stage human pose estimation with graph pose refinement. 16^th European Conf on Computer Vision, p.492–508. https://doi.org/10.1007/978-3-030-58621-8_29
Wang M, Hong RC, Yuan XT, et al., 2012. Movie2Comics: towards a lively video content presentation. IEEE Trans Multim, 14(3):858–870. https://doi.org/10.1109/TMM.2012.2187181
Article Google Scholar
Wang MC, Tighe J, Modolo D, 2020. Combining detection and tracking for human pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11085–11093. https://doi.org/10.1109/CVPR42600.2020.01110
Wang XL, Girshick R, Gupta A, et al., 2018. Non-local neural networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7794–7803. https://doi.org/10.1109/CVPR.2018.00813
Wang XT, Chan KCK, Yu K, et al., 2019. EDVR: video restoration with enhanced deformable convolutional networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1954–1963. https://doi.org/10.1109/CVPRW.2019.00247
Weinzaepfel P, Revaud J, Harchaoui Z, et al., 2013. DeepFlow: large displacement optical flow with deep matching. IEEE Int Conf on Computer Vision, p. 1385–1392. https://doi.org/10.1109/ICCV.2013.175
Xiao B, Wu HP, Wei YC, 2018. Simple baselines for human pose estimation and tracking. 15^th European Conf on Computer Vision, p.472–487. https://doi.org/10.1007/978-3-030-01231-1_29
Xiu YL, Li JF, Wang HY, et al., 2018. Pose flow: efficient online pose tracking. https://doi.org/10.48550/arXiv.1802.00977
Yang X, Wang M, Hong RC, et al., 2017. Enhancing person re-identification in a self-trained subspace. ACM Trans Multim Comput Commun Appl, 13(3):27. https://doi.org/10.1145/3089249
Google Scholar
Yang X, Wang M, Tao DC, 2018. Person re-identification with metric learning using privileged information. IEEE Trans Image Process, 27(2):791–805. https://doi.org/10.1109/TIP.2017.2765836
Article MathSciNet Google Scholar
Yang YD, Ren Z, Li HX, et al., 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8070–8080. https://doi.org/10.1109/CVPR46437.2021.00798
Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions. https://doi.org/10.48550/arXiv.1511.07122
Zhang JB, Zhu Z, Zou W, et al., 2019. FastPose: towards real-time pose estimation and tracking via scale-normalized multitask networks. https://doi.org/10.48550/arXiv.1908.05593
Zheng W, Li L, Zhang ZX, et al., 2019. Relational network for skeleton-based action recognition. IEEE Int Conf on Multimedia and Expo, p.826–831. https://doi.org/10.1109/ICME.2019.00147
Zhu XZ, Hu H, Lin S, et al., 2019. Deformable ConvNets V2: more deformable, better results. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9300–9308. https://doi.org/10.1109/CVPR.2019.00953

Download references

Author information

Authors and Affiliations

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, 400044, China
Yuanhong Zhong (仲元红) & Qianfeng Xu (徐乾锋)
Bioengineering College of Chongqing University, Chongqing University, Chongqing, 400044, China
Daidi Zhong (钟代笛)
School of Information Science and Technology, University of Science and Technology of China, Hefei, 230039, China
Xun Yang (杨勋)
Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230039, China
Shanshan Wang (王姗姗)

Authors

Yuanhong Zhong (仲元红)
View author publications
You can also search for this author in PubMed Google Scholar
Qianfeng Xu (徐乾锋)
View author publications
You can also search for this author in PubMed Google Scholar
Daidi Zhong (钟代笛)
View author publications
You can also search for this author in PubMed Google Scholar
Xun Yang (杨勋)
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Wang (王姗姗)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuanhong ZHONG designed the research. Qianfeng XU and Daidi ZHONG processed the data. Yuanhong ZHONG, Qianfeng XU, and Daidi ZHONG drafted the paper. Xun YANG and Shanshan WANG helped organize the paper. All the authors revised and finalized the paper.

Corresponding author

Correspondence to Yuanhong Zhong (仲元红).

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the National Key Research and Development Program of China (Nos. 2021YFC2009200 and 2023YFC3606100) and the Special Project of Technological Innovation and Application Development of Chongqing, China (No. cstc2019jscx-msxmX0167)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhong, Y., Xu, Q., Zhong, D. et al. FaSRnet: a feature and semantics refinement network for human pose estimation. Front Inform Technol Electron Eng 25, 513–526 (2024). https://doi.org/10.1631/FITEE.2200639

Download citation

Received: 12 December 2022
Accepted: 27 June 2023
Published: 10 May 2024
Issue Date: March 2024
DOI: https://doi.org/10.1631/FITEE.2200639

Key words

关键词

CLC number

TP391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FaSRnet: a feature and semantics refinement network for human pose estimation

Abstract

摘要

Access this article

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation