View-Invariant Probabilistic Embedding for Human Pose

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


Depictions of similar human body configurations can vary with changing viewpoints. Using only 2D information, we would like to enable vision algorithms to recognize similarity in human body poses across multiple views. This ability is useful for analyzing body movements and human behaviors in images and videos. In this paper, we propose an approach for learning a compact view-invariant embedding space from 2D joint keypoints alone, without explicitly predicting 3D poses. Since 2D poses are projected from 3D space, they have an inherent ambiguity, which is difficult to represent through a deterministic mapping. Hence, we use probabilistic embeddings to model this input uncertainty. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 2D-to-3D pose lifting models. We also demonstrate the effectiveness of applying our embeddings to view-invariant action recognition and video alignment. Our code is available at


Human pose embedding Probabilistic embedding View-invariant pose retrieval 



We thank Yuxiao Wang, Debidatta Dwibedi, and Liangzhe Yuan from Google Research, Long Zhao from Rutgers University, and Xiao Zhang from University of Chicago for helpful discussions. We appreciate the support of Pietro Perona, Yisong Yue, and the Computational Vision Lab at Caltech for making this collaboration possible. The author Jennifer J. Sun is supported by NSERC (funding number PGSD3-532647-2019) and Caltech.

Supplementary material

504441_1_En_4_MOESM1_ESM.pdf (2.3 mb)
Supplementary material 1 (pdf 2333 KB)


  1. 1.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: CVPR (2015)Google Scholar
  2. 2.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  3. 3.
    Bojchevski, A., Günnemann, S.: Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking. In: ICLR (2018)Google Scholar
  4. 4.
    Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: NeurIPS (1994)Google Scholar
  5. 5.
    Cao, C., Zhang, Y., Zhang, C., Lu, H.: Body joint guided 3-D deep convolutional descriptors for action recognition. IEEE Trans. Cybern. 48(3), 1095–1108 (2017)CrossRefGoogle Scholar
  6. 6.
    Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: CVPR (2017)Google Scholar
  7. 7.
    Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., Rehg, J.M.: Unsupervised 3D pose estimation with geometric self-supervision. In: CVPR (2019)Google Scholar
  8. 8.
    Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., Wei, Y.: Vehicle re-identification with viewpoint-aware metric learning. In: ICCV (2019)Google Scholar
  9. 9.
    Drover, D., M. V, R., Chen, C.-H., Agrawal, A., Tyagi, A., Huynh, C.P.: Can 3D pose be learned from 2D projections alone? In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 78–94. Springer, Cham (2019). Scholar
  10. 10.
    Du, W., Wang, Y., Qiao, Y.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In: ICCV (2017)Google Scholar
  11. 11.
    Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycle-consistency learning. In: CVPR (2019)Google Scholar
  12. 12.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)Google Scholar
  13. 13.
    Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
  14. 14.
    Ho, C.H., Morgado, P., Persekian, A., Vasconcelos, N.: PIEs: pose invariant embeddings. In: CVPR, pp. 12377–12386 (2019)Google Scholar
  15. 15.
    Hu, W., Zhu, S.C.: Learning a probabilistic model mixing 3D and 2D primitives for view invariant object recognition. In: CVPR (2010)Google Scholar
  16. 16.
    Huang, C., Loy, C.C., Tang, X.: Local similarity-aware deep feature embedding. In: NeurIPS (2016)Google Scholar
  17. 17.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE TPAMI 36, 1325–1339 (2013)CrossRefGoogle Scholar
  18. 18.
    Iqbal, U., Garbade, M., Gall, J.: Pose for action-action for pose. In: FG (2017)Google Scholar
  19. 19.
    Iscen, A., Tolias, G., Avrithis, Y., Chum, O.: Mining on manifolds: metric learning without labels. In: CVPR (2018)Google Scholar
  20. 20.
    Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y.: Learnable triangulation of human pose. In: ICCV (2019)Google Scholar
  21. 21.
    Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., Jawahar, C.: Video retrieval by mimicking poses. In: ACM ICMR (2012)Google Scholar
  22. 22.
    Ji, X., Liu, H.: Advances in view-invariant human motion analysis: a review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(1), 13–24 (2009)Google Scholar
  23. 23.
    Ji, X., Liu, H., Li, Y., Brown, D.: Visual-based view-invariant human motion analysis: a review. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008. LNCS (LNAI), vol. 5177, pp. 741–748. Springer, Heidelberg (2008). Scholar
  24. 24.
    Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? In: NeurIPS (2017)Google Scholar
  25. 25.
    Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: ICLR (2014)Google Scholar
  26. 26.
    Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: CVPR (2019)Google Scholar
  27. 27.
    LeCun, Y., Huang, F.J., Bottou, L., et al.: Learning methods for generic object recognition with invariance to pose and lighting. In: CVPR (2004)Google Scholar
  28. 28.
    Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.: Unsupervised learning of view-invariant action representations. In: NeurIPS (2018)Google Scholar
  29. 29.
    Liu, J., Akhtar, N., Ajmal, M.: Viewpoint invariant action recognition using RGB-D videos. IEEE Access 6, 70061–70071 (2018)CrossRefGoogle Scholar
  30. 30.
    Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: CVPR (2018)Google Scholar
  31. 31.
    Luvizon, D.C., Tabia, H., Picard, D.: Multi-task deep learning for real-time 3D human pose estimation and action recognition. arXiv:1912.08077 (2019)
  32. 32.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)Google Scholar
  33. 33.
    Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV (2017)Google Scholar
  34. 34.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  35. 35.
    Mori, G., et al.: Pose embeddings: A deep architecture for learning to match human poses. arXiv:1507.00302 (2015)
  36. 36.
    Nie, B.X., Xiong, C., Zhu, S.C.: Joint action recognition and pose estimation from video. In: CVPR (2015)Google Scholar
  37. 37.
    Oh, S.J., Murphy, K., Pan, J., Roth, J., Schroff, F., Gallagher, A.: Modeling uncertainty with hedged instance embedding. In: ICLR (2019)Google Scholar
  38. 38.
    Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: CVPR (2016)Google Scholar
  39. 39.
    Ong, E.J., Micilotta, A.S., Bowden, R., Hilton, A.: Viewpoint invariant exemplar-based 3D human tracking. CVIU 104, 178–189 (2006)Google Scholar
  40. 40.
    Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 282–299. Springer, Cham (2018). Scholar
  41. 41.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)Google Scholar
  42. 42.
    Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BMVC (2015)Google Scholar
  43. 43.
    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)Google Scholar
  44. 44.
    Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W.: Cross View Fusion for 3D Human Pose Estimation. In: ICCV (2019)Google Scholar
  45. 45.
    Rao, C., Shah, M.: View-invariance in action recognition. In: CVPR (2001)Google Scholar
  46. 46.
    Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 69–86. Springer, Cham (2018). Scholar
  47. 47.
    Rhodin, H., Constantin, V., Katircioglu, I., Salzmann, M., Fua, P.: Neural scene decomposition for multi-person motion capture. In: CVPR (2019)Google Scholar
  48. 48.
    Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3D human pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 765–782. Springer, Cham (2018). Scholar
  49. 49.
    Rhodin, H., et al.: Learning monocular 3D human pose estimation from multi-view images. In: CVPR (2018)Google Scholar
  50. 50.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  51. 51.
    Sermanet, P., et al.: Time-contrastive networks: self-supervised learning from video. In: ICRA (2018)Google Scholar
  52. 52.
    Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). Scholar
  53. 53.
    Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P.: Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: ICCV (2017)Google Scholar
  54. 54.
    Tome, D., Toso, M., Agapito, L., Russell, C.: Rethinking pose in 3D: multi-stage refinement and recovery for markerless motion capture. In: 3DV (2018)Google Scholar
  55. 55.
    Vilnis, L., McCallum, A.: Word representations via Gaussian embedding. In: ICLR (2015)Google Scholar
  56. 56.
    Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)Google Scholar
  57. 57.
    Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3D pose estimation. In: CVPR (2015)Google Scholar
  58. 58.
    Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: ICCV (2017)Google Scholar
  59. 59.
    Xia, L., Chen, C.C., Aggarwal, J.K.: View invariant human action recognition using histograms of 3D joints. In: CVPRW (2012)Google Scholar
  60. 60.
    Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: ICCV (2013)Google Scholar
  61. 61.
    Zheng, L., Huang, Y., Lu, H., Yang, Y.: Pose invariant embedding for deep person re-identification. IEEE TIP 28, 4500–4509 (2019)MathSciNetzbMATHGoogle Scholar
  62. 62.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.California Institute of TechnologyPasadenaUSA
  2. 2.Google ResearchLos AngelesUSA

Personalised recommendations