Skip to main content

Advertisement

Log in

Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

An Erratum to this article was published on 12 June 2015

Abstract

Purpose

Context-aware systems for the operating room (OR) provide the possibility to significantly improve surgical workflow through various applications such as efficient OR scheduling, context-sensitive user interfaces, and automatic transcription of medical procedures. Being an essential element of such a system, surgical action recognition is thus an important research area. In this paper, we tackle the problem of classifying surgical actions from video clips that capture the activities taking place in the OR.

Methods

We acquire recordings using a multi-view RGBD camera system mounted on the ceiling of a hybrid OR dedicated to X-ray-based procedures and annotate clips of the recordings with the corresponding actions. To recognize the surgical actions from the video clips, we use a classification pipeline based on the bag-of-words (BoW) approach. We propose a novel feature encoding method that extends the classical BoW approach. Instead of using the typical rigid grid layout to divide the space of the feature locations, we propose to learn the layout from the actual 4D spatio-temporal locations of the visual features. This results in a data-driven and non-rigid layout which retains more spatio-temporal information compared to the rigid counterpart.

Results

We classify multi-view video clips from a new dataset generated from 11-day recordings of real operations. This dataset is composed of 1734 video clips of 15 actions. These include generic actions (e.g., moving patient to the OR bed) and actions specific to the vertebroplasty procedure (e.g., hammering). The experiments show that the proposed non-rigid feature encoding method performs better than the rigid encoding one. The classifier’s accuracy is increased by over 4 %, from 81.08 to 85.53 %.

Conclusion

The combination of both intensity and depth information from the RGBD data provides more discriminative power in carrying out the surgical action recognition task as compared to using either one of them alone. Furthermore, the proposed non-rigid spatio-temporal feature encoding scheme provides more discriminative histogram representations than the rigid counterpart. To the best of our knowledge, this is also the first work that presents action recognition results on multi-view RGBD data recorded in the OR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Barrera F, Padoy N (2014) Piecewise planar decomposition of 3d point clouds obtained from multiple static RGB-D cameras. In: International conference on 3D vision (3DV)

  2. Bhatia B, Oates T, Xiao Y, Hu PFM (2007) Real-time identification of operating room state from video. In: AAAI, AAAI Press, pp 1761–1766

  3. Blum T, Feussner H, Navab N (2010) Modeling and segmentation of surgical workflow from laparoscopic video. In: MICCAI (3), Springer, pp 400–407

  4. Chakraborty I, Elgammal A, Burd RS (2013) Video based activity recognition in trauma resuscitation. In: 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp 1–8

  5. Choi J, Jeon WJ, Lee SC (2008) Spatio-temporal pyramid matching for sports videos. In: MIR, ACM, pp 291–297

  6. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETS

  7. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: ICCV, IEEE, pp 3192–3199

  8. Kadkhodamohammadi A, Gangi A, de Mathelin M, Padoy N (2014) Temporally consistent 3d pose estimation in the interventional room using discrete MRF optimization over RGBD sequences. In: IPCAI, pp 168–177

  9. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: ICCV, IEEE, pp 1487–1494

  10. Lalys F, Riffaud L, Bouget D, Jannin P (2012) A framework for the recognition of high-level surgical tasks from video images for cataract surgeries. IEEE Trans Biomed Eng 59(4):966–976

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  11. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR

  12. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp 2169–2178

  13. Lea C, Facker JC, Hager GD, Taylor RH, Saria S (2013) 3d sensing algorithms towards building an intelligent intensive care unit. In: AMIA summits on translational science proceedings

  14. Loy Rodas N, Padoy N (2015) Seeing is believing: increasing intraoperative awareness to scattered radiation in interventional procedures by combining augmented reality, Monte Carlo simulations and wireless dosimeters. Int J Comput Assist Radiol Surg. doi:10.1007/s11548-015-1161-x

  15. Padoy N, Mateus D, Weinland D, Berger MO, Navab N (2009) Workflow monitoring based on 3D motion features. In: Workshop on video-oriented object and event classification in conjunction with ICCV 2009, pp 585–592

  16. Padoy N, Blum T, Ahmadi SA, Feussner H, Berger MO, Navab N (2012) Statistical modeling and recognition of surgical workflow. Med Image Anal 16(3):632–641

    Article  PubMed  Google Scholar 

  17. Twinanda AP, Marescaux J, Mathelin MD, Padoy N (2014) Towards better laparoscopic video database organization by automatic surgery classification. In: IPCAI, pp 186–194

  18. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ICM, ACM, pp 1469–1472

  19. Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR

  20. Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: CVPR

  21. Zappella L, Bejar B, Hager G, Vidal R (2013) Surgical gesture classification from video and kinematic data. Med Image Anal 17(7):732–745

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by the French ANR within the Investissements d’Avenir program under references ANR-11-LABX-0004 (Labex CAMI), ANR-10-IDEX-0002-02 (IdEx Unistra), and ANR-10-IAHU-02 (IHU Strasbourg). The authors would like to thank the medical staff of the Interventional Radiology Department at Nouvel Hôpital Civil Strasbourg for their collaboration during the data acquisition process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andru P. Twinanda.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Twinanda, A.P., Alkan, E.O., Gangi, A. et al. Data-driven spatio-temporal RGBD feature encoding for action recognition in operating rooms. Int J CARS 10, 737–747 (2015). https://doi.org/10.1007/s11548-015-1186-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-015-1186-1

Keywords

Navigation