Advertisement

Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room

  • Lasse HansenEmail author
  • Marlin Siebert
  • Jasper Diesel
  • Mattias P. Heinrich
Original Article
  • 120 Downloads

Abstract

Purpose

For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain.

Methods

We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder.

Results

We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors.

Conclusions

We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.

Keywords

Human pose estimation Deep learning 2D–3D information fusion Convolutional autoencoder Operating room 

Notes

Acknowledgements

We would like to thank the reviewers for their many insightful comments and suggestions helping to improve our paper. We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no relevant conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Statement of informed consent was not applicable since the manuscript does not contain any participants’ data.

References

  1. 1.
    Achilles F, Ichim AE, Coskun H, Tombari F, Noachtar S, Navab N (2016) Patient mocap: human pose estimation under blanket occlusion for hospital monitoring applications. In: Proceedings of the international conference on medical image computing and computer-assisted intervention (MICCAI). Springer, pp 491–499Google Scholar
  2. 2.
    Andriluka M, Iqbal U, Insafutdinov E, Pishchulin L, Milan A, Gall J, Schiele B (2018) Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5167–5176Google Scholar
  3. 3.
    Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 3686–3693Google Scholar
  4. 4.
    Andriluka M, Roth S, Schiele B (2009) Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1014–1021Google Scholar
  5. 5.
    Belagiannis V, Wang X, Shitrit HBB, Hashimoto K, Stauder R, Aoki Y, Kranzfelder M, Schneider A, Fua P, Ilic S, Feussner H, Navab N (2016) Parsing human skeletons in an operating room. Mach Vis Appl (MVA) 27(7):1035–1046CrossRefGoogle Scholar
  6. 6.
    Cao Z, Simon T, Wei S.E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7291–7299Google Scholar
  7. 7.
    Chen K, Gabriel P, Alasfour A, Gong C, Doyle WK, Devinsky O, Friedman D, Dugan P, Melloni L, Thesen T, Gonda D, Sattar S, Wang S, Gilja V (2018) Patient-specific pose estimation in clinical environments. J Transl Eng Health Med (JTEHM) 6:1–11Google Scholar
  8. 8.
    Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 7103–7112Google Scholar
  9. 9.
    Dietz A, Schröder S, Pösch A, Frank K, Reithmeier E (2016) Contactless surgery light control based on 3D gesture recognition. In: GCAI, pp 138–146Google Scholar
  10. 10.
    Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable part model. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1–8Google Scholar
  11. 11.
    Girshick R, Shotton J, Kohli P, Criminisi A, Fitzgibbon A (2011) Efficient regression of general-activity human poses from depth images. In: Proceedings of the international conference on computer vision (ICCV). IEEE, pp 415–422Google Scholar
  12. 12.
    Hansen L, Diesel J, Heinrich MP (2019) Regularized landmark detection with CAEs for human pose estimation in the operating room. In: Bildverarbeitung für die Medizin (BVM). Springer, pp 178–183Google Scholar
  13. 13.
    Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L (2016) Towards viewpoint invariant 3D human pose estimation. In: Proccedings of the European conference on computer vision (ECCV). Springer, pp 160–177Google Scholar
  14. 14.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 770–778Google Scholar
  15. 15.
    Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. Trans Pattern Anal Mach Intell (TPAMI) 36(7):1325–1339CrossRefGoogle Scholar
  16. 16.
    Jacob MG, Li YT, Akingba GA, Wachs JP (2013) Collaboration with a robotic scrub nurse. Commun ACM 56(5):68–75CrossRefGoogle Scholar
  17. 17.
    Jung HY, Suh Y, Moon G, Lee KM (2016) A sequential approach to 3d human pose estimation: separation of localization and identification of body joints. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 747–761Google Scholar
  18. 18.
    Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N (2017) A multi-view RGB-D approach for human pose estimation in operating rooms. In: Proceedings of the winter conference on applications of computer vision (WACV). IEEE, pp 363–372Google Scholar
  19. 19.
    Kadkhodamohammadi A, Padoy N (2018) A generalizable approach for multi-view 3D human pose regression. arXiv:1804.10462
  20. 20.
    Katircioglu I, Tekin B, Salzmann M, Lepetit V, Fua P (2018) Learning latent representations of 3D human pose with deep neural networks. Int J Comput Vis (IJCV) 126:1–16CrossRefGoogle Scholar
  21. 21.
    Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
  22. 22.
    Liu S, Yin Y, Ostadabbas S (2019) In-bed pose estimation: deep learning with shallow dataset. IEEE J Transl Eng Health Med 7:1–12.  https://doi.org/10.1109/JTEHM.2019.2892970 CrossRefGoogle Scholar
  23. 23.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 21–37Google Scholar
  24. 24.
    McCoy TH, Perlis RH (2018) Temporal trends and characteristics of reportable health data breaches, 2010–2017. JAMA 320(12):1282–1284CrossRefGoogle Scholar
  25. 25.
    Moon G, Yong Chang J, Mu Lee K (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 5079–5088Google Scholar
  26. 26.
    Mori G, Ren X, Efros AA, Malik J (2018) Recovering human body configurations: combining segmentation and recognition. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), vol. 2. IEEE (2004)Google Scholar
  27. 27.
    Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30, Curran Associates, Inc., pp 2277–2287. http://papers.nips.cc/paper/6822-associative-embedding-end-to-end-learning-for-joint-detection-and-grouping.pdf
  28. 28.
    Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). Springer, pp 483–499Google Scholar
  29. 29.
    Padoy N, Blum T, Ahmadi SA, Feussner H, Berger MO, Navab N (2012) Statistical modeling and recognition of surgical workflow. Med Image Anal 16(3):632–641CrossRefGoogle Scholar
  30. 30.
    Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In: Advances in neural information processing systems workshop (NIPS-W)Google Scholar
  31. 31.
    Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1263–1272Google Scholar
  32. 32.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99Google Scholar
  33. 33.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). IEEE, pp 1297–1304Google Scholar
  34. 34.
    Silas MR, Grassia P, Langerman A (2015) Video recording of the operating room-is anonymity possible? J Surg Res 197(2):272–276CrossRefGoogle Scholar
  35. 35.
    Srivastav V, Issenhuth T, Kadkhodamohammadi A, de Mathelin M, Gangi A, Padoy N (2018) MVOR: a multi-view RGB-D operating room dataset for 2D and 3D human pose estimation. arXiv:1808.08180
  36. 36.
    Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the conference on computer vision and pattern recognition (CVPR), pp 1653–1660Google Scholar
  37. 37.
    Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res (JMLR) 11(Dec):3371–3408Google Scholar
  38. 38.
    Xiao B, Wu H, Wei, Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV)Google Scholar
  39. 39.
    Yao A, Gall J, Van Gool L (2012) Coupled action recognition and pose estimation from multiple views. Int J Comput Vis 100(1):16–37CrossRefGoogle Scholar
  40. 40.
    Yusoff YA, Basori AH, Mohamed F (2013) Interactive hand and arm gesture control for 2D medical image and 3D volumetric medical visualization. Proc Soc Behav Sci 97:723–729CrossRefGoogle Scholar

Copyright information

© CARS 2019

Authors and Affiliations

  1. 1.Institute of Medical InformaticsUniversity of LübeckLübeckGermany
  2. 2.Drägerwerk AG & Co. KGaALübeckGermany

Personalised recommendations