Skip to main content

Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room

Abstract

Purpose

For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain.

Methods

We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder.

Results

We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors.

Conclusions

We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. [38]: https://github.com/Microsoft/human-pose-estimation.pytorch          [25]: https://github.com/dragonbook/V2V-PoseNet-pytorch.

References

  1. Achilles F, Ichim AE, Coskun H, Tombari F, Noachtar S, Navab N (2016) Patient mocap: human pose estimation under blanket occlusion for hospital monitoring applications. In: Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCAI). Springer, pp 491–499

  2. Andriluka M, Iqbal U, Insafutdinov E, Pishchulin L, Milan A, Gall J, Schiele B (2018) Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5167–5176

  3. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 3686–3693

  4. Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1014–1021

  5. Belagiannis V, Wang X, Shitrit HBB, Hashimoto K, Stauder R, Aoki Y, Kranzfelder M, Schneider A, Fua P, Ilic S, Feussner H, Navab N (2016) Parsing human skeletons in an operating room. Mach Vis Appl (MVA) 27(7):1035–1046

    Article  Google Scholar 

  6. Cao Z, Simon T, Wei S.E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7291–7299

  7. Chen K, Gabriel P, Alasfour A, Gong C, Doyle WK, Devinsky O, Friedman D, Dugan P, Melloni L, Thesen T, Gonda D, Sattar S, Wang S, Gilja V (2018) Patient-specific pose estimation in clinical environments. J Transl Eng Health Med (JTEHM) 6:1–11

    Google Scholar 

  8. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7103–7112

  9. Dietz A, Schröder S, Pösch A, Frank K, Reithmeier E (2016) Contactless surgery light control based on 3D gesture recognition. In: GCAI, pp 138–146

  10. Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1–8

  11. Girshick R, Shotton J, Kohli P, Criminisi A, Fitzgibbon A (2011) Efficient regression of general-activity human poses from depth images. In: Proceedings of the international conference on computer vision (ICCV). IEEE, pp 415–422

  12. Hansen L, Diesel J, Heinrich MP (2019) Regularized landmark detection with CAEs for human pose estimation in the operating room. In: Bildverarbeitung für die Medizin (BVM). Springer, pp 178–183

  13. Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L (2016) Towards viewpoint invariant 3D human pose estimation. In: Proccedings of the European conference on computer vision (ECCV). Springer, pp 160–177

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 770–778

  15. Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. Trans Pattern Anal Mach Intell (TPAMI) 36(7):1325–1339

    Article  Google Scholar 

  16. Jacob MG, Li YT, Akingba GA, Wachs JP (2013) Collaboration with a robotic scrub nurse. Commun ACM 56(5):68–75

    Article  Google Scholar 

  17. Jung HY, Suh Y, Moon G, Lee KM (2016) A sequential approach to 3d human pose estimation: separation of localization and identification of body joints. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 747–761

  18. Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N (2017) A multi-view RGB-D approach for human pose estimation in operating rooms. In: Proceedings of the winter conference on applications of computer vision (WACV). IEEE, pp 363–372

  19. Kadkhodamohammadi A, Padoy N (2018) A generalizable approach for multi-view 3D human pose regression. arXiv:1804.10462

  20. Katircioglu I, Tekin B, Salzmann M, Lepetit V, Fua P (2018) Learning latent representations of 3D human pose with deep neural networks. Int J Comput Vis (IJCV) 126:1–16

    Article  Google Scholar 

  21. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980

  22. Liu S, Yin Y, Ostadabbas S (2019) In-bed pose estimation: deep learning with shallow dataset. IEEE J Transl Eng Health Med 7:1–12. https://doi.org/10.1109/JTEHM.2019.2892970

    Article  Google Scholar 

  23. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 21–37

  24. McCoy TH, Perlis RH (2018) Temporal trends and characteristics of reportable health data breaches, 2010–2017. JAMA 320(12):1282–1284

    Article  Google Scholar 

  25. Moon G, Yong Chang J, Mu Lee K (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 5079–5088

  26. Mori G, Ren X, Efros AA, Malik J (2018) Recovering human body configurations: combining segmentation and recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), vol. 2. IEEE (2004)

  27. Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 2277–2287. http://papers.nips.cc/paper/6822-associative-embedding-end-to-end-learning-for-joint-detection-and-grouping.pdf

  28. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 483–499

  29. Padoy N, Blum T, Ahmadi SA, Feussner H, Berger MO, Navab N (2012) Statistical modeling and recognition of surgical workflow. Med Image Anal 16(3):632–641

    Article  Google Scholar 

  30. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in neural information processing systems workshop (NIPS-W)

  31. Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1263–1272

  32. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99

  33. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1297–1304

  34. Silas MR, Grassia P, Langerman A (2015) Video recording of the operating room-is anonymity possible? J Surg Res 197(2):272–276

    Article  Google Scholar 

  35. Srivastav V, Issenhuth T, Kadkhodamohammadi A, de Mathelin M, Gangi A, Padoy N (2018) MVOR: a multi-view RGB-D operating room dataset for 2D and 3D human pose estimation. arXiv:1808.08180

  36. Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 1653–1660

  37. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res (JMLR) 11(Dec):3371–3408

    Google Scholar 

  38. Xiao B, Wu H, Wei, Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV)

  39. Yao A, Gall J, Van Gool L (2012) Coupled action recognition and pose estimation from multiple views. Int J Comput Vis 100(1):16–37

    Article  Google Scholar 

  40. Yusoff YA, Basori AH, Mohamed F (2013) Interactive hand and arm gesture control for 2D medical image and 3D volumetric medical visualization. Proc Soc Behav Sci 97:723–729

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the reviewers for their many insightful comments and suggestions helping to improve our paper. We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lasse Hansen.

Ethics declarations

Conflict of interest

The authors declare that they have no relevant conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Statement of informed consent was not applicable since the manuscript does not contain any participants’ data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hansen, L., Siebert, M., Diesel, J. et al. Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room. Int J CARS 14, 1871–1879 (2019). https://doi.org/10.1007/s11548-019-02044-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-019-02044-7

Keywords

  • Human pose estimation
  • Deep learning
  • 2D–3D information fusion
  • Convolutional autoencoder
  • Operating room