Advertisement

Key Frame Proposal Network for Efficient Pose Estimation in Videos

Conference paper
  • 508 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12362)

Abstract

Human pose estimation in video relies on local information by either estimating each frame independently or tracking poses across frames. In this paper, we propose a novel method combining local approaches with global context. We introduce a light weighted, unsupervised, key frame proposal network (K-FPN) to select informative frames and a learned dictionary to recover the entire pose sequence from these frames. The K-FPN speeds up the pose estimation and provides robustness to bad frames with occlusion, motion blur, and illumination changes, while the learned dictionary provides global dynamic context. Experiments on Penn Action and sub-JHMDB datasets show that the proposed method achieves state-of-the-art accuracy, with substantial speed-up.

Keywords

Fast human pose estimation in videos Key frame proposal network (K-FPN) Unsupervised learning 

Notes

Acknowledgements

This work was supported by NSF grants IIS–1814631 and ECCS–1808381; and the Alert DHS Center of Excellence under Award Number 2013-ST-061-ED0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security.

References

  1. 1.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014Google Scholar
  2. 2.
    Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 468–475. IEEE (2017)Google Scholar
  3. 3.
    Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely-labeled videos. In: Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 3027–3038. Curran Associates, Inc. (2019). http://papers.nips.cc/paper/8567-learning-temporal-pose-estimation-from-sparsely-labeled-videos.pdf
  4. 4.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
  5. 5.
    Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Personalizing human video pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3063–3072 (2016)Google Scholar
  6. 6.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems, pp. 1736–1744 (2014)Google Scholar
  7. 7.
    Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4715–4723 (2016)Google Scholar
  8. 8.
    Cristani, M., Raghavendra, R., Del Bue, A., Murino, V.: Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100, 86–97 (2013)CrossRefGoogle Scholar
  9. 9.
    Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 728–743. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_44CrossRefGoogle Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 http://arxiv.org/abs/1512.03385 (2015)
  11. 11.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. CoRR abs/1612.01925 http://arxiv.org/abs/1612.01925 (2016)
  12. 12.
    Iqbal, U., Garbade, M., Gall, J.: Pose for action-action for pose. In: 2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 438–445. IEEE (2017)Google Scholar
  13. 13.
    Iqbal, U., Milan, A., Gall, J.: Pose-track: joint multi-person pose estimation and tracking. CoRR abs/1611.07727 http://arxiv.org/abs/1611.07727 (2016)
  14. 14.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (ICCV), pp. 3192–3199, December 2013Google Scholar
  15. 15.
    Kreiss, S., Bertoni, L., Alahi, A.: PifPaf: composite fields for human pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  16. 16.
    Lin, H.-Y., Chen, T.-W.: Augmented reality with human body interaction based on monocular 3D pose estimation. In: Blanc-Talon, J., Bone, D., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2010. LNCS, vol. 6474, pp. 321–331. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-17688-3_31CrossRefGoogle Scholar
  17. 17.
    Lin, M., Lin, L., Liang, X., Wang, K., Cheng, H.: Recurrent 3D pose sequence machines. CoRR abs/1707.09695 http://arxiv.org/abs/1707.09695 (2017)
  18. 18.
    Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR abs/1612.03144 http://arxiv.org/abs/1612.03144 (2016)
  19. 19.
    Liu, W., Sharma, A., Camps, O.I., Sznaier, M.: DYAN: a dynamical atoms network for video prediction. CoRR abs/1803.07201 http://arxiv.org/abs/1803.07201 (2018)
  20. 20.
    Luo, Y., et al.: LSTM pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5207–5215 (2018)Google Scholar
  21. 21.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. CoRR abs/1603.06937 http://arxiv.org/abs/1603.06937 (2016)
  22. 22.
    Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: ECCV (2018)Google Scholar
  23. 23.
    Nie, X., Li, Y., Luo, L., Zhang, N., Feng, J.: Dynamic kernel distillation for efficient pose estimation in videos. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  24. 24.
    Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. CoRR abs/1701.01779 http://arxiv.org/abs/1701.01779 (2017)
  25. 25.
    Park, D., Ramanan, D.: N-best maximal decoders for part models. In: 2011 International Conference on Computer Vision, pp. 2627–2634. IEEE (2011)Google Scholar
  26. 26.
    Park, S., Trivedi, M.M.: Understanding human interactions with track and body synergies (TBS) captured from multiple views. Comput. Vis. Image Underst. 111(1), 2–20 (2008)CrossRefGoogle Scholar
  27. 27.
    Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1913–1921 (2015)Google Scholar
  28. 28.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3487–3494 (2013)Google Scholar
  29. 29.
    Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: CVPR 2011, pp. 1297–1304. IEEE (2011)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  31. 31.
    Song, J., Wang, L., Van Gool, L., Hilliges, O.: Thin-slicing network: a deep structured model for pose estimation in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4220–4229 (2017)Google Scholar
  32. 32.
    Tang, W., Yu, P., Wu, Y.: Deeply learned compositional models for human pose estimation. In: The European Conference on Computer Vision (ECCV), September 2018Google Scholar
  33. 33.
    Tempo, R., Bai, E.W., Dabbene, F.: Probabilistic robustness analysis: explicit bounds for the minimum number of samples. In: Proceedings of 35th IEEE Conference on Decision and Control, vol. 3, pp. 3424–3428, December 1996Google Scholar
  34. 34.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660, June 2014.  https://doi.org/10.1109/CVPR.2014.214
  35. 35.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  36. 36.
    Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. CoRR abs/1804.06208 http://arxiv.org/abs/1804.06208 (2018)
  37. 37.
    Xiaohan Nie, B., Xiong, C., Zhu, S.C.: Joint action recognition and pose estimation from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)Google Scholar
  38. 38.
    Yang, J., et al.: Quantization networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  39. 39.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. arXiv preprint arXiv:1708.01101 (2017)
  40. 40.
    Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3073–3082 (2016)Google Scholar
  41. 41.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR 2011, pp. 1385–1392 (2011).  https://doi.org/10.1109/CVPR.2011.5995741
  42. 42.
    Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  43. 43.
    Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: 2013 IEEE International Conference on Computer Vision, pp. 2248–2255, December 2013.  https://doi.org/10.1109/ICCV.2013.280

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Electrical and Computer EngineeringNortheastern UniversityBostonUSA
  2. 2.Motorola Solutions, Inc.SomervilleUSA

Personalised recommendations