Skip to main content

Pose for Everything: Towards Category-Agnostic Pose Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Existing works on 2D pose estimation mainly focus on a certain category, e.g. human, animal, and vehicle. However, there are lots of application scenarios that require detecting the poses/keypoints of the unseen class of objects. In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. To achieve this goal, we formulate the pose estimation problem as a keypoint matching problem and design a novel CAPE framework, termed POse Matching Network (POMNet). A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms. Experiments show that our method outperforms other baseline approaches by a large margin. Codes and data are available at https://github.com/luminxu/Pose-for-Everything.

L. Xu and S. Jin—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  2. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  3. Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D face alignment problem? In: International Conference on Computer Vision (2017)

    Google Scholar 

  4. Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: International Conference on Computer Vision (2019)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  7. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  8. Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  9. Contributors, M.: OpenMMlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)

  10. Duan, H., Lin, K.Y., Jin, S., Liu, W., Qian, C., Ouyang, W.: TRB: a novel triplet representation for understanding 2D human body. In: International Conference on Computer Vision (2019)

    Google Scholar 

  11. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)

    Google Scholar 

  12. Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  13. Graving, J.M., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. Elife 8, e47994 (2019)

    Google Scholar 

  14. Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: International Conference on Computer Vision (2017)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  16. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (2013)

    Google Scholar 

  17. Jiang, S., Liang, S., Chen, C., Zhu, Y., Li, X.: Class agnostic image common object detection. IEEE Trans. Image Process. 28(6), 2836–2846 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  18. Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  19. Jin, S., et al.: Differentiable hierarchical graph grouping for multi-person pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 718–734. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_42

    Chapter  Google Scholar 

  20. Jin, S., et al.: Towards multi-person pose tracking: bottom-up and top-down methods. In: International Conference on Computer Vision Workshop (2017)

    Google Scholar 

  21. Jin, S., et al.: Whole-body human pose estimation in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 196–214. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_12

    Chapter  Google Scholar 

  22. Khan, M.H., et al.: AnimalWeb: a large-scale hierarchical dataset of annotated animal faces. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representions (2015)

    Google Scholar 

  24. Kostinger, M., Wohlhart, P., Roth, P., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: International Conference on Computer Vision Workshop (2011)

    Google Scholar 

  25. Labuguen, R., et al.: MacaquePose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2021)

    Google Scholar 

  26. Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: International Conference on Computer Vision (2021)

    Google Scholar 

  27. Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  28. Li, S., Li, J., Tang, H., Qian, R., Lin, W.: ATRW: a benchmark for amur tiger re-identification in the wild. In: ACM International Conference on Multimedia (2020)

    Google Scholar 

  29. Li, Y., et al.: TokenPose: learning keypoint tokens for human pose estimation. arXiv preprint arXiv:2104.03516 (2021)

  30. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  31. Lu, E., Xie, W., Zisserman, A.: Class-agnostic counting. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 669–684. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_42

    Chapter  Google Scholar 

  32. Lu, J., Gong, P., Ye, J., Zhang, C.: Learning from very few samples: a survey. arXiv preprint arXiv:2009.02653 (2020)

  33. Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z.: TFPose: direct human pose estimation with transformers. arXiv preprint arXiv:2103.15320 (2021)

  34. Mathis, A., et al.: Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)

    Google Scholar 

  35. Moon, G., Yu, S.-I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 548–564. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_33

    Chapter  Google Scholar 

  36. Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  37. Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: International Conference on Computer Vision (2017)

    Google Scholar 

  38. Nakamura, A., Harada, T.: Revisiting fine-tuning for few-shot learning. arXiv preprint arXiv:1910.00216 (2019)

  39. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  40. Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: International Conference on Computer Vision (2019)

    Google Scholar 

  41. Parmar, N., et al.: Image transformer. In: ICML (2018)

    Google Scholar 

  42. Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16, 117–125 (2019)

    Article  Google Scholar 

  43. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International Conference on Learning Representions (2017)

    Google Scholar 

  44. Reddy, N.D., Vo, M., Narasimhan, S.G.: CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  45. Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image Vision Comput. 47, 3–18 (2016)

    Article  Google Scholar 

  46. Shen, J., Zafeiriou, S., Chrysos, G.G., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: International Conference on Computer Vision Workshop (2015)

    Google Scholar 

  47. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  48. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advance Neural Information and Processing Systems (2017)

    Google Scholar 

  49. Song, X., et al.: ApolloCar3D: a large 3D car instance understanding benchmark for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  50. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  51. Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: International Conference on Computer Vision (2017)

    Google Scholar 

  52. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  53. Vaswani, A., et al.: Attention is all you need. In: Advance Neural Information and Processing Systems (2017)

    Google Scholar 

  54. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advance Neural Information and Processing Systems (2016)

    Google Scholar 

  55. Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circ. Syst. Video Technol. 29, 3258–3268 (2018)

    Article  Google Scholar 

  56. Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  57. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  58. Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report CNS-TR-2010-001, California Institute of Technology (2010)

    Google Scholar 

  59. Wu, J., et al.: AI challenger: a large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475 (2017)

  60. Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22

    Chapter  Google Scholar 

  61. Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  62. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29

    Chapter  Google Scholar 

  63. Xu, L., et al.: ViPNAS: efficient video pose estimation via neural architecture search. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  64. Yang, F.S.Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  65. Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: towards explainable human pose estimation by transformer. arXiv preprint arXiv:2012.14214 (2020)

  66. Yang, S.D., Su, H.T., Hsu, W.H., Chen, W.C.: Class-agnostic few-shot object counting. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)

    Google Scholar 

  67. Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: AP-10k: a benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)

  68. Yuan, Y., et al.: HRFormer: high-resolution transformer for dense prediction. arXiv preprint arXiv:2110.09408 (2021)

  69. Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., Shen, J.: The Menpo facial landmark localisation challenge: a step towards the solution. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop (2017)

    Google Scholar 

  70. Zeng, W., et al.: Not all tokens are equal: human-centric visual analysis via token clustering transformer. In: IEEE Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  71. Zhang, C., Lin, G., Liu, F., Yao, R., Shen, C.: CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226 (2019)

    Google Scholar 

  72. Zhang, S.H., et al.: Pose2Seg: detection free human instance segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  73. Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In: ACM International Conference on Multimedia (2018)

    Google Scholar 

  74. Zhou, X., Karpur, A., Luo, L., Huang, Q.: StarMap for category-agnostic keypoint and viewpoint estimation. In: European Conference on Computer Vision, pp. 318–334 (2018)

    Google Scholar 

  75. Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: International Conference on Computer Vision (2017)

    Google Scholar 

  76. Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: International Conference on Computer Vision (2019)

    Google Scholar 

Download references

Acknowledgement

This work is supported in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants (Nos. 14202217, 14203118, 14208619), in part by Research Impact Fund Grant No. R5001-18. Ping Luo is supported by the General Research Fund of HK No. 27208720, No. 17212120, and No. 17200622. Wanli Ouyang is supported by the Australian Research Council Grant DP200103223, Australian Medical Research Future Fund MRFAI000085, CRC-P Smart Material Recovery Facility (SMRF) - Curby Soft Plastics, and CRC-P ARIA - Bionic Visual-Spatial Prosthesis for the Blind.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lumin Xu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1264 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, L. et al. (2022). Pose for Everything: Towards Category-Agnostic Pose Estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20068-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20067-0

  • Online ISBN: 978-3-031-20068-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics