Abstract
Multiple human pose estimation is an important yet challenging problem. In an operating room (OR) environment, the 3D body poses of surgeons and medical staff can provide important clues for surgical workflow analysis. For that purpose, we propose an algorithm for localizing and recovering body poses of multiple human in an OR environment under a multi-camera setup. Our model builds on 3D Pictorial Structures and 2D body part localization across all camera views, using convolutional neural networks (ConvNets). To evaluate our algorithm, we introduce a dataset captured in a real OR environment. Our dataset is unique, challenging and publicly available with annotated ground truths. Our proposed algorithm yields to promising pose estimation results on this dataset.
Similar content being viewed by others
References
Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. Pattern Anal. Mach. Intell. IEEE Trans. 28(1), 44–58 (2006)
Alahari, K., Seguin, G., Sivic, J., Laptev, I.: Pose estimation and segmentation of people in 3d movies. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 2112–2119. IEEE (2013)
Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE (2008)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: people detection and articulated pose estimation. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1014–1021. IEEE (2009)
Andriluka, M., Roth, S., Schiele, B.: Monocular 3d pose estimation and tracking by detection. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 623–630. IEEE (2010)
Belagiannis, V., Amann, C., Navab, N., Ilic, S.: Holistic human pose estimation with regression forests. In: Perales, F.J., Santos-Victor, J. (eds.) Articulated Motion and Deformable Objects, pp. 20–30. Springer (2014)
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures for multiple human pose estimation. In: Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on. IEEE (2014)
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S.: 3D pictorial structures revisited: multiple human pose estimation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), pp. 1–1 (2015). doi:10.1109/TPAMI.2015.2509986
Belagiannis, V., Rupprecht, C., Carneiro, G., Navab, N.: Robust optimization for deep regression. In: Computer Vision (ICCV), 2015 IEEE International Conference on. IEEE (2015)
Belagiannis, V., Wang, X., Schiele, B., Fua, P., Ilic, S., Navab, N.: Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Computer Vision—ECCV 2014, ChaLearn Looking at People Workshop. Springer (2014)
Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using k-shortest paths optimization. Pattern Anal. Mach. Intell. IEEE Trans. 33(9), 1806–1819 (2011)
Bishop, C.M., et al.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)
Burenius, M., Sullivan, J., Carlsson, S.: 3D pictorial structures for multiple view articulated pose estimation. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3618–3625. IEEE (2013)
Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in Neural Information Processing Systems (2014)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893. IEEE (2005)
Eichner, M., Ferrari, V.: We are family: joint pose estimation of multiple persons. In: Computer Vision—ECCV 2010, pp. 228–242. Springer (2010)
Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. Int. J. Comput. Vis. 61(1), 55–79 (2005)
Finley, T., Joachims, T.: Training structural svms when exact inference is intractable. In: Proceedings of the 25th International Conference on Machine Learning, pp. 304–311. ACM (2008)
Fischler, M.A., Elschlager, R.A.: The representation and matching of pictorial structures. IEEE Trans. Comput. 22(1), 67–92 (1973)
Gammeter, S., Ess, A., Jäggli, T., Schindler, K., Leibe, B., Van Gool, L.: Articulated multi-body tracking under egomotion. In: Computer Vision–ECCV 2008, pp. 816–830. Springer (2008)
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 415–422. IEEE (2011)
Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3D structure with a statistical image-based shape model. In: Computer Vision, 2003. Proceedings of the Ninth IEEE International Conference on, pp. 641–647. IEEE (2003)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Hofmann, M., Gavrila, D.M.: Multi-view 3D human pose estimation in complex environment. Int. J. Comput. Vis. 96(1), 103–124 (2012)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 3192–3199. IEEE (2013)
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Lallemand, J., Pauly, O., Schwarz, L., Tan, D., Ilic, S.: Multi-task forest for human pose estimation in depth images. In: 3DTV-Conference, 2013 International Conference on, pp. 271–278. IEEE (2013)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Lee, M.W., Nevatia, R.: Human pose tracking using multi-level structured models. In: Computer Vision–ECCV 2006, pp. 368–381. Springer (2006)
Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Asian Conference on Computer Vision—ACCV 2014 (2014)
Liu, Y., Gall, J., Stoll, C., Dai, Q., Seidel, H.P., Theobalt, C.: Markerless motion capture of multiple characters using multiview image segmentation. Pattern Anal. Mach. Intell. IEEE Trans. 35(11), 2720–2735 (2013)
Luo, X., Berendsen, B., Tan, R.T., Veltkamp, R.C.: Human pose estimation for multiple persons based on volume reconstruction. In: Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 3591–3594. IEEE (2010)
Mitchelson, J.R., Hilton, A.: Simultaneous pose estimation of multiple people using multiple-view cues with hierarchical sampling. In: BMVC, pp. 1–10 (2003)
Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006)
Padoy, N., Blum, T., Feussner, H., Berger, M.O., Navab, N.: On-line recognition of surgical activity for monitoring in the operating room. In: AAAI, pp. 1718–1724 (2008)
Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian Conference on Computer Vision (2014)
Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 588–595. IEEE (2013)
Plankers, R., Fua, P.: Articulated soft objects for multi-view shape and motion capture. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 63–83 (2003)
Ramanan, D., Forsyth, D.A.: Finding and tracking people from the bottom up. In: Computer Vision and Pattern Recognition, 2003. Proceedings of the 2003 IEEE Computer Society Conference on, vol. 2, pp. II–467. IEEE (2003)
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)
Shitrit, H.B., Berclaz, J., Fleuret, F., Fua, P.: Multi-commodity network flow for tracking multiple people. Pattern Anal. Mach. Intell. IEEE Trans. 36(8), 1614–1627 (2014)
Sigal, L., Black, M.J.: Guest editorial: state of the art in image-and video-based human pose and motion estimation. Int. J. Comput. Vis. 87(1), 1–3 (2010)
Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density propagation for 3D human motion estimation. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 390–397. IEEE (2005)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Stauder, R., Okur, A., Peter, L., Schneider, A., Kranzfelder, M., Feussner, H., Navab, N.: Random forests for phase detection in surgical workflow analysis. In: Information Processing in Computer-Assisted Interventions, pp. 148–157. Springer (2014)
Taylor, G.W., Sigal, L., Fleet, D.J., Hinton, G.E.: Dynamical binary latent variable models for 3D human pose tracking. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 631–638. IEEE (2010)
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems (2014)
Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on. IEEE (2014)
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005)
Turetken, E., Wang, X., Becker, C., Fua, P.: Detecting and tracking cells using network flow programming. arXiv preprint arXiv:1501.05499 (2015)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511. IEEE (2001)
Wang, X.: Tracking interacting objects in image sequences. Ph.D. thesis, EPFL (2015)
Wang, X., Ablavsky, V., Shitrit, H.B., Fua, P.: Take your eyes off the ball: improving ball-tracking by focusing on team play. Comput. Vis. Image Underst. 119, 102–115 (2014)
Wang, X., Turetken, E., Fleuret, F., Fua, P.: Tracking interacting objects optimally using integer programming. In: ECCV, pp. 17–32 (2014)
Wang, X., Turetken, E., Fleuret, F., Fua, P.: Tracking interacting objects using intertwined flows. IEEE Trans. Pattern Anal. Mach. Intell. (99), 1–1. doi:10.1109/TPAMI.2015.2513406
Weede, O., Dittrich, F., Worn, H., Jensen, B., Knoll, A., Wilhelm, D., Kranzfelder, M., Schneider, A., Feussner, H.: Workflow analysis and surgical phase recognition in minimally invasive surgery. In: Robotics and Biomimetics (ROBIO), 2012 IEEE International Conference on, pp. 1080–1074. IEEE (2012)
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1385–1392. IEEE (2011)
Yao, A., Gall, J., Gool, L.V., Urtasun, R.: Learning probabilistic non-linear latent variable models for tracking complex activities. In: Advances in Neural Information Processing Systems, pp. 1359–1367 (2011)
Zhao, T., Nevatia, R.: Tracking multiple humans in complex situations. Pattern Anal. Mach. Intell. IEEE Trans. 26(9), 1208–1221 (2004)
Acknowledgments
This work was supported in part by the Swiss National Science Foundation and by DFG - Deutsche Forschungsgemeinschaft under the project “Advanced Learning for Tracking and Detection in Medical Workflow Analysis”. The authors would like to thank Iro Laina for helping with the data preparation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Belagiannis, V., Wang, X., Shitrit, H.B.B. et al. Parsing human skeletons in an operating room. Machine Vision and Applications 27, 1035–1046 (2016). https://doi.org/10.1007/s00138-016-0792-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-016-0792-4