Pose Proposal Networks

  • Taiki SekiiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


We propose a novel method to detect an unknown number of articulated 2D poses in real time. To decouple the runtime complexity of pixel-wise body part detectors from their convolutional neural network (CNN) feature map resolutions, our approach, called pose proposal networks, introduces a state-of-the-art single-shot object detection paradigm using grid-wise image feature maps in a bottom-up pose detection scenario. Body part proposals, which are represented as region proposals, and limbs are detected directly via a single-shot CNN. Specialized to such detections, a bottom-up greedy parsing step is probabilistically redesigned to take into account the global context. Experimental results on the MPII Multi-Person benchmark confirm that our method achieves 72.8% mAP comparable to state-of-the-art bottom-up approaches while its total runtime using a GeForce GTX1080Ti card reaches up to 5.6 ms (180 FPS), which exceeds the bottleneck runtimes that are observed in state-of-the-art approaches.


Human pose estimation Object detection 

Supplementary material

474201_1_En_21_MOESM1_ESM.pdf (87 kb)
Supplementary material 1 (pdf 87 KB)


  1. 1.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  2. 2.
    Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: Regional multi-person pose estimation. In: ICCV (2017)Google Scholar
  3. 3.
    Gkioxari, G., Hariharan, B., Girshick, R., Malik, J.: Using \(k\)-poselets for detecting people and localizing their keypoints. In: CVPR (2014)Google Scholar
  4. 4.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  5. 5.
    Insafutdinov, E., et al.: ArtTrack: Articulated multi-person tracking in the wild. In: CVPR (2017)Google Scholar
  6. 6.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). Scholar
  7. 7.
    Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associations. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 627–642. Springer, Cham (2016). Scholar
  8. 8.
    Levinkov, E., et al.: Joint graph decomposition and node labeling: Problem, algorithms, applications. In: CVPR (2017)Google Scholar
  9. 9.
    Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. In: NIPS (2017)Google Scholar
  10. 10.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR (2017)Google Scholar
  11. 11.
    Pishchulin, L., et al.: DeepCut: Joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)Google Scholar
  12. 12.
    Pishchulin, L., Jain, A., Andriluka, M., Thormählen, T., Schiele, B.: Articulated people detection and pose estimation: Reshaping the future. In: CVPR (2012)Google Scholar
  13. 13.
    Varadarajan, S., Datta, P., Tickoo, O.: A greedy part assignment algorithm for real-time multi-person 2D pose estimation (2017). arXiv preprint arXiv:1708.09182
  14. 14.
    Liu, W., et al.: SSD: Single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  15. 15.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR (2016)Google Scholar
  16. 16.
    Redmon, J., Farhadi, A.: YOLO9000: Better, faster, stronger. In: CVPR (2017)Google Scholar
  17. 17.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. PAMI 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  18. 18.
    Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR (2009)Google Scholar
  19. 19.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV 61(1), 55–79 (2005)CrossRefGoogle Scholar
  20. 20.
    Lan, X., Huttenlocher, D.P.: Beyond trees: common-factor models for 2D human pose recovery. In: ICCV (2005)Google Scholar
  21. 21.
    Sigal, L., Black, M.J.: Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In: CVPR (2006)Google Scholar
  22. 22.
    Tian, Y., Zitnick, C.L., Narasimhan, S.G.: Exploring the spatial hierarchy of mixture models for human pose estimation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 256–269. Springer, Heidelberg (2012). Scholar
  23. 23.
    Wang, Y., Mori, G.: Multiple tree models for occlusion and spatial constraints in human pose estimation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 710–724. Springer, Heidelberg (2008). Scholar
  24. 24.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS (2014)Google Scholar
  25. 25.
    Toshev, A., Szegedy, C.: DeepPose: Human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  26. 26.
    Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  27. 27.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)Google Scholar
  28. 28.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  29. 29.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  30. 30.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016). Scholar
  31. 31.
    Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In: ICCV (2017)Google Scholar
  32. 32.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: CVPR (2017)Google Scholar
  33. 33.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV (2017)Google Scholar
  34. 34.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  35. 35.
    West, D.B.: Introduction to graph theory. Featured Titles for Graph Theory Series. Prentice Hall, Upper Saddle River (2001)Google Scholar
  36. 36.
    Kuhn, H.W.: The hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: New benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  38. 38.
    Russakovsky, O.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. In: MM. ACM (2014)Google Scholar
  40. 40.
    Lin, T.Y., et al.: Microsoft COCO: Common objects in context (2014). arXiv preprint arXiv:1405.0312

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Konica Minolta, Inc.OsakaJapan

Personalised recommendations