Advertisement

Self-supervision on Unlabelled or Data for Multi-person 2D/3D Human Pose Estimation

Conference paper
  • 5.4k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12261)

Abstract

2D/3D human pose estimation is needed to develop novel intelligent tools for the operating room that can analyze and support the clinical activities. The lack of annotated data and the complexity of state-of-the-art pose estimation approaches limit, however, the deployment of such techniques inside the OR. In this work, we propose to use knowledge distillation in a teacher/student framework to harness the knowledge present in a large-scale non-annotated dataset and in an accurate but complex multi-stage teacher network to train a lightweight network for joint 2D/3D pose estimation. The teacher network also exploits the unlabeled data to generate both hard and soft labels useful in improving the student predictions. The easily deployable network trained using this effective self-supervision strategy performs on par with the teacher network on MVOR+, an extension of the public MVOR dataset where all persons have been fully annotated, thus providing a viable solution for real-time 2D/3D human pose estimation in the OR.

Keywords

Human pose estimation Knowledge distillation Data distillation Operating room Low-resolution images 

Notes

Acknowledgements

This work was supported by French state funds managed by the ANR within the Investissements d’Avenir program under reference ANR-16-CE33-0009 (DeepSurg). The authors would also like to thank the members of the Interventional Radiology Department at University Hospital of Strasbourg for their help in generating the dataset.

Supplementary material

505204_1_En_74_MOESM1_ESM.pdf (3.2 mb)
Supplementary material 1 (pdf 3285 KB)

References

  1. 1.
    Belagiannis, V., et al.: Parsing human skeletons in an operating room. Mach. Vis. Appl. 27(7), 1035–1046 (2016).  https://doi.org/10.1007/s00138-016-0792-4CrossRefGoogle Scholar
  2. 2.
    Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2019).  https://doi.org/10.1109/TPAMI.2019.2956516
  3. 3.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7291–7299 (2017)Google Scholar
  4. 4.
    Chou, E., et al.: Privacy-preserving action recognition for smart hospitals using low-resolution depth images. NeurIPS Workshop on Machine Learning for Health (ML4H) (2018)Google Scholar
  5. 5.
    Dabral, R., Gundavarapu, N.B., Mitra, R., Sharma, A., Ramakrishnan, G., Jain, A.: Multi-person 3D human pose estimation from monocular images. In: 2019 International Conference on 3D Vision (3DV). pp. 405–414. IEEE (2019)Google Scholar
  6. 6.
    Hansen, L., Siebert, M., Diesel, J., Heinrich, M.P.: Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room. Int. J. Comput. Assist. Radiol. Surg. 14(11), 1871–1879 (2019)CrossRefGoogle Scholar
  7. 7.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2961–2969 (2017)Google Scholar
  8. 8.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
  9. 9.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)CrossRefGoogle Scholar
  10. 10.
    Kadkhodamohammadi, A., Gangi, A., de Mathelin, M., Padoy, N.: Articulated clinician detection using 3D pictorial structures on RGB-D data. Med. Image Anal. 35, 215–224 (2017)CrossRefGoogle Scholar
  11. 11.
    Kadkhodamohammadi, A., Gangi, A., de Mathelin, M., Padoy, N.: A multi-view RGB-D approach for human pose estimation in operating rooms. In: IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 363–372. IEEE (2017)Google Scholar
  12. 12.
    Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  13. 13.
    Main-Hein, L., et al.: Surgical data science: enabling next-generation surgery. Nature Biomed. Eng. 1, 691–696 (2017)CrossRefGoogle Scholar
  14. 14.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2640–2649 (2017)Google Scholar
  15. 15.
    Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: towards omni-supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4119–4128 (2018)Google Scholar
  16. 16.
    Srivastav, V., Gangi, A., Padoy, N.: Human pose estimation on privacy-preserving low-resolution depth images. In: Shen, D. (ed.) MICCAI 2019. LNCS, vol. 11768, pp. 583–591. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-32254-0_65CrossRefGoogle Scholar
  17. 17.
    Srivastav, V., Issenhuth, T., Abdolrahim, K., de Mathelin, M., Gangi, A., Padoy, N.: Mvor: A multi-view RGB-D operating room dataset for 2D and 3D human pose estimation. In: MICCAI-LABELS workshop (2018)Google Scholar
  18. 18.
    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  19. 19.
    Vercauteren, T., Unberath, M., Padoy, N., Navab, N.: Cai4cai: the rise of contextual artificial intelligence in computer-assisted interventions. Proc. IEEE 108(1), 198–214 (2019)CrossRefGoogle Scholar
  20. 20.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)
  21. 21.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)Google Scholar
  22. 22.
    Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3517–3526 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.ICube, University of Strasbourg, CNRS, IHU StrasbourgStrasbourgFrance
  2. 2.Radiology DepartmentUniversity Hospital of StrasbourgStrasbourgFrance

Personalised recommendations