Abstract
Existing works on 2D pose estimation mainly focus on a certain category, e.g. human, animal, and vehicle. However, there are lots of application scenarios that require detecting the poses/keypoints of the unseen class of objects. In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. To achieve this goal, we formulate the pose estimation problem as a keypoint matching problem and design a novel CAPE framework, termed POse Matching Network (POMNet). A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms. Experiments show that our method outperforms other baseline approaches by a large margin. Codes and data are available at https://github.com/luminxu/Pose-for-Everything.
L. Xu and S. Jin—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? In: International Conference on Computer Vision (2017)
Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: International Conference on Computer Vision (2019)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Contributors, M.: OpenMMlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)
Duan, H., Lin, K.Y., Jin, S., Liu, W., Qian, C., Ouyang, W.: TRB: a novel triplet representation for understanding 2D human body. In: International Conference on Computer Vision (2019)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)
Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Graving, J.M., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8, e47994 (2019)
Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: International Conference on Computer Vision (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (2013)
Jiang, S., Liang, S., Chen, C., Zhu, Y., Li, X.: Class agnostic image common object detection. IEEE Trans. Image Process. 28(6), 2836–2846 (2019)
Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Jin, S., et al.: Differentiable hierarchical graph grouping for multi-person pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 718–734. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_42
Jin, S., et al.: Towards multi-person pose tracking: bottom-up and top-down methods. In: International Conference on Computer Vision Workshop (2017)
Jin, S., et al.: Whole-body human pose estimation in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 196–214. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_12
Khan, M.H., et al.: AnimalWeb: a large-scale hierarchical dataset of annotated animal faces. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representions (2015)
Kostinger, M., Wohlhart, P., Roth, P., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: International Conference on Computer Vision Workshop (2011)
Labuguen, R., et al.: MacaquePose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2021)
Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: International Conference on Computer Vision (2021)
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Li, S., Li, J., Tang, H., Qian, R., Lin, W.: ATRW: a benchmark for amur tiger re-identification in the wild. In: ACM International Conference on Multimedia (2020)
Li, Y., et al.: TokenPose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:2104.03516 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 669–684. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_42
Lu, J., Gong, P., Ye, J., Zhang, C.: Learning from very few samples: a survey. arXiv preprint arXiv:2009.02653 (2020)
Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z.: TFPose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320 (2021)
Mathis, A., et al.: Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
Moon, G., Yu, S.-I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 548–564. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_33
Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: International Conference on Computer Vision (2017)
Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216 (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: International Conference on Computer Vision (2019)
Parmar, N., et al.: Image transformer. In: ICML (2018)
Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16, 117–125 (2019)
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International Conference on Learning Representions (2017)
Reddy, N.D., Vo, M., Narasimhan, S.G.: CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image Vision Comput. 47, 3–18 (2016)
Shen, J., Zafeiriou, S., Chrysos, G.G., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: International Conference on Computer Vision Workshop (2015)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advance Neural Information and Processing Systems (2017)
Song, X., et al.: ApolloCar3D: a large 3D car instance understanding benchmark for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: International Conference on Computer Vision (2017)
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Vaswani, A., et al.: Attention is all you need. In: Advance Neural Information and Processing Systems (2017)
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advance Neural Information and Processing Systems (2016)
Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circ. Syst. Video Technol. 29, 3258–3268 (2018)
Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report CNS-TR-2010-001, California Institute of Technology (2010)
Wu, J., et al.: AI challenger: a large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017)
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Xu, L., et al.: ViPNAS: efficient video pose estimation via neural architecture search. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Yang, F.S.Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214 (2020)
Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: AP-10k: a benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)
Yuan, Y., et al.: HRFormer: high-resolution transformer for dense prediction. arXiv preprint arXiv:2110.09408 (2021)
Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., Shen, J.: The Menpo facial landmark localisation challenge: a step towards the solution. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop (2017)
Zeng, W., et al.: Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE Conference on Computer Vision and Pattern Recognition (2022)
Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C.: CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226 (2019)
Zhang, S.H., et al.: Pose2Seg: detection free human instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In: ACM International Conference on Multimedia (2018)
Zhou, X., Karpur, A., Luo, L., Huang, Q.: StarMap for category-agnostic keypoint and viewpoint estimation. In: European Conference on Computer Vision, pp. 318–334 (2018)
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: International Conference on Computer Vision (2017)
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: International Conference on Computer Vision (2019)
Acknowledgement
This work is supported in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants (Nos. 14202217, 14203118, 14208619), in part by Research Impact Fund Grant No. R5001-18. Ping Luo is supported by the General Research Fund of HK No. 27208720, No. 17212120, and No. 17200622. Wanli Ouyang is supported by the Australian Research Council Grant DP200103223, Australian Medical Research Future Fund MRFAI000085, CRC-P Smart Material Recovery Facility (SMRF) - Curby Soft Plastics, and CRC-P ARIA - Bionic Visual-Spatial Prosthesis for the Blind.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, L. et al. (2022). Pose for Everything: Towards Category-Agnostic Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-20068-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)