Advertisement

Deeply Learned Compositional Models for Human Pose Estimation

  • Wei Tang
  • Pei Yu
  • Ying WuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)

Abstract

Compositional models represent patterns with hierarchies of meaningful parts and subparts. Their ability to characterize high-order relationships among body parts helps resolve low-level ambiguities in human pose estimation (HPE). However, prior compositional models make unrealistic assumptions on subpart-part relationships, making them incapable to characterize complex compositional patterns. Moreover, state spaces of their higher-level parts can be exponentially large, complicating both inference and learning. To address these issues, this paper introduces a novel framework, termed as Deeply Learned Compositional Model (DLCM), for HPE. It exploits deep neural networks to learn the compositionality of human bodies. This results in a novel network with a hierarchical compositional architecture and bottom-up/top-down inference stages. In addition, we propose a novel bone-based part representation. It not only compactly encodes orientations, scales and shapes of parts, but also avoids their potentially large state spaces. With significantly lower complexities, our approach outperforms state-of-the-art methods on three benchmark datasets.

Notes

Acknowledgement

This work was supported in part by National Science Foundation grant IIS-1217302, IIS-1619078, and the Army Research Office ARO W911NF-16-1-0138.

Supplementary material

474178_1_En_12_MOESM1_ESM.pdf (3.4 mb)
Supplementary material 1 (pdf 3448 KB)

Supplementary material 2 (avi 20984 KB)

References

  1. 1.
    Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris, I.A.: 3D human pose estimation: a review of the literature and analysis of covariates. Comput. Vis. Image Underst. 152, 1–20 (2016)CrossRefGoogle Scholar
  2. 2.
    Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Amari, S., Arbib, M.A. (eds.) Competition and Cooperation in Neural Nets, pp. 267–285. Springer, Heidelberg (1982).  https://doi.org/10.1007/978-3-642-46466-9_18CrossRefGoogle Scholar
  3. 3.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  4. 4.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  5. 5.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  6. 6.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  7. 7.
    Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3073–3082 (2016)Google Scholar
  8. 8.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_44CrossRefGoogle Scholar
  9. 9.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)Google Scholar
  10. 10.
    Geman, S., Potter, D.F., Chi, Z.: Composition systems. Q. Appl. Math. 60(4), 707–736 (2002)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Bienenstock, E., Geman, S., Potter, D.: Compositionality, MDL priors, and object recognition. In: Advances in Neural Information Processing Systems, pp. 838–844 (1997)Google Scholar
  12. 12.
    Tian, Y., Zitnick, C.L., Narasimhan, S.G.: Exploring the spatial hierarchy of mixture models for human pose estimation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 256–269. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_19CrossRefGoogle Scholar
  13. 13.
    Zhu, S.C., Mumford, D., et al.: A Stochastic Grammar of Images, vol. 2. Now Publishers, Inc., Hanover (2007). https://dl.acm.org/citation.cfm?id=1315337
  14. 14.
    Zhu, L.L., Chen, Y., Yuille, A.: Recursive compositional models for vision: description and review of recent work. J. Math. Imaging Vis. 41(1–2), 122 (2011)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Wang, Y., Tran, D., Liao, Z.: Learning hierarchical poselets for human parsing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1705–1712 (2011)Google Scholar
  16. 16.
    Rothrock, B., Park, S., Zhu, S.C.: Integrating grammar and segmentation for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3221 (2013)Google Scholar
  17. 17.
    Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: IEEE International Conference on Computer Vision, pp. 723–730 (2011)Google Scholar
  18. 18.
    Park, S., Zhu, S.C.: Attributed grammars for joint estimation of human attributes, part and pose. In: IEEE International Conference on Computer Vision, pp. 2372–2380 (2015)Google Scholar
  19. 19.
    Park, S., Nie, B.X., Zhu, S.C.: Attribute and-or grammar for joint parsing of human pose, parts and attributes. IEEE Trans. Pattern Anal. Mach. Intell. 40(7), 1555–1569 (2018)CrossRefGoogle Scholar
  20. 20.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled functions. Theory Comput. 8(1), 415–428 (2012)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)Google Scholar
  22. 22.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1465–1472 (2011)Google Scholar
  23. 23.
    Tran, D., Forsyth, D.: Improved human parsing with a full relational model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_17CrossRefGoogle Scholar
  24. 24.
    Jin, Y., Geman, S.: Context and hierarchy in a probabilistic image model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2145–2152 (2006)Google Scholar
  25. 25.
    Tang, W., Yu, P., Zhou, J., Wu, Y.: Towards a unified compositional model for visual pattern modeling. In: IEEE International Conference on Computer Vision, pp. 2803–2812 (2017)Google Scholar
  26. 26.
    Duan, K., Batra, D., Crandall, D.J.: A multi-layer composite model for human pose estimation. In: British Machine Vision Conference (2012)Google Scholar
  27. 27.
    Wang, J., Yuille, A.L.: Semantic part segmentation using compositional model combining shape and appearance. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1788–1797 (2015)Google Scholar
  28. 28.
    Zhu, L., Chen, Y., Torralba, A., Freeman, W., Yuille, A.: Part and appearance sharing: recursive compositional models for multi-view. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1919–1926 (2010)Google Scholar
  29. 29.
    Hu, P., Ramanan, D.: Bottom-up and top-down reasoning with hierarchical rectified Gaussians. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5600–5609 (2016)Google Scholar
  30. 30.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference o Computer Vision and Pattern Recognitionn, pp. 1385–1392 (2011)Google Scholar
  31. 31.
    Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: IEEE International Conference on Computer Vision, pp. 2621–2630 (2017)Google Scholar
  32. 32.
    Ai, B., Zhou, Y., Yu, Y., Du, S.: Human pose estimation using deep structure guided learning. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1224–1231 (2017)Google Scholar
  33. 33.
    Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: IEEE International Conference on Automatic Face Gesture Recognition, pp. 468–475 (2017)Google Scholar
  34. 34.
    Boureau, Y.L., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning, pp. 111–118 (2010)Google Scholar
  35. 35.
    Wan, L., Eigen, D., Fergus, R.: End-to-end integration of a convolution network, deformable parts model and non-maximum suppression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 851–859 (2015)Google Scholar
  36. 36.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  37. 37.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2014)Google Scholar
  38. 38.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  39. 39.
    Sapp, B., Taskar, B.: MODEC: multimodal decomposable models for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3681 (2013)Google Scholar
  40. 40.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: British Machine Vision Conference (2010)Google Scholar
  41. 41.
    Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016)Google Scholar
  42. 42.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)Google Scholar
  43. 43.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems, pp. 1736–1744 (2014)Google Scholar
  44. 44.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_3CrossRefGoogle Scholar
  45. 45.
    Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 246–260. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_16CrossRefGoogle Scholar
  46. 46.
    Yu, X., Zhou, F., Chandraker, M.: Deep deformation network for object landmark localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 52–70. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_4CrossRefGoogle Scholar
  47. 47.
    Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: IEEE International Conference on Computer Vision, pp. 1221–1230 (2017)Google Scholar
  48. 48.
    Sun, K., Lan, C., Xing, J., Zeng, W., Liu, D., Wang, J.: Human pose estimation using global and local normalization. In: IEEE International Conference on Computer Vision, pp. 5600–5608 (2017)Google Scholar
  49. 49.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: The IEEE International Conference on Computer Vision, pp. 1290–1299 (2017)Google Scholar
  50. 50.
    Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, Upper Saddle River (2002)Google Scholar
  51. 51.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: NIPS Workshop (2011)Google Scholar
  52. 52.
    Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)Google Scholar
  53. 53.
    Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 728–743. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_44CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Northwestern UniversityEvanstonUSA

Personalised recommendations