Skip to main content
Log in

Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

Real-time surgical workflow analysis has been a key component for computer-assisted intervention system to improve cognitive assistance. Most existing methods solely rely on conventional temporal models and encode features with a successive spatial–temporal arrangement. Supportive benefits of intermediate features are partially lost from both visual and temporal aspects. In this paper, we rethink feature encoding to attend and preserve the critical information for accurate workflow recognition and anticipation.

Methods

We introduce Transformer in surgical workflow analysis, to reconsider complementary effects of spatial and temporal representations. We propose a hybrid embedding aggregation Transformer, named Trans-SVNet, to effectively interact with the designed spatial and temporal embeddings, by employing spatial embedding to query temporal embedding sequence. We jointly optimized by loss objectives from both analysis tasks to leverage their high correlation.

Results

We extensively evaluate our method on three large surgical video datasets. Our method consistently outperforms the state-of-the-arts across three datasets on workflow recognition task. Jointly learning with anticipation, recognition results can gain a large improvement. Our approach also shows its effectiveness on anticipation with promising performance achieved. Our model achieves a real-time inference speed of 0.0134 second per frame.

Conclusion

Experimental results demonstrate the efficacy of our hybrid embeddings integration by rediscovering the crucial cues from complementary spatial–temporal embeddings. The better performance by multi-task learning indicates that anticipation task brings the additional knowledge to recognition task. Promising effectiveness and efficiency of our method also show its promising potential to be used in operating room.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S (2017) Surgical data science for next-generation interventions. Nature Biomedical Engineering

  2. Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minimally Invasive Therapy & Allied Technol 28(2):82–90

    Article  Google Scholar 

  3. Maier-Hein L, Eisenmann M, Sarikaya D, März K, Collins T, Malpani A, Fallert J, Feussner H, Giannarou S, Mascagni P (2022) Surgical data science-from concepts toward clinical translation. Med image anal 76:102306

    Article  PubMed  Google Scholar 

  4. Rivoir D, Bodenstedt S, Funke I, Bechtolsheim Fv, Distler M, Weitz J, Speidel S (2020) Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance. In: MICCAI, pp 752–762. Springer

  5. Yuan K, Holden M, Gao S, Lee W-S (2021) Surgical workflow anticipation using instrument interaction. In: MICCAI, pp 615–625. Springer

  6. Forestier G, Riffaud L, Jannin P (2015) Automatic phase prediction from low-level surgical activities. IJCARS 10(6):833–841

    Google Scholar 

  7. Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2017) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE TMI 36(1):86–97

    Google Scholar 

  8. Lalys F, Bouget D, Riffaud L, Jannin P (2013) Automatic knowledge-based recognition of low-level tasks in ophthalmological procedures. IJCARS 8(1):39–49

    Google Scholar 

  9. Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C-W, Heng P-A (2018) SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE TMI 37(5):1114–1126

    Google Scholar 

  10. Yi F, Jiang T (2019) Hard frame detection and online mapping for surgical phase recognition. In: MICCAI

  11. Twinanda AP, Yengera G, Mutter D, Marescaux J, Padoy N (2018) Rsdnet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE TMI 38(4):1069–1078

    Google Scholar 

  12. Funke I, Bodenstedt S, Oehme F, von Bechtolsheim F, Weitz J, Speidel S (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: MICCAI

  13. Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: CVPR, pp 156–165

  14. Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: MICCAI

  15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008

  16. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556

  17. Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: ECCV, pp 214–229. Springer

  18. Wang Y, Solomon JM (2019) Deep closest point: Learning representations for point cloud registration. In: CVPR, pp 3523–3532

  19. Jin Y, Li H, Dou Q, Chen H, Qin J, Fu C-W, Heng P-A (2020) Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med Image Anal 59:101572

    Article  PubMed  Google Scholar 

  20. Zhang J, Nie Y, Lyu Y, Yang X, Chang J, Zhang JJ (2021) Sd-net: joint surgical gesture recognition and skill assessment. IJCARS 16(10):1675–1682

    CAS  Google Scholar 

  21. Franke S, Neumuth T (2015) Adaptive surgical process models for prediction of surgical work steps from surgical low-level activities. In: 6th Workshop on M2CAI at MICCAI

  22. Gao X, Jin Y, Long Y, Dou Q, Heng P-A (2021) Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: MICCAI, pp 593–603. Springer

  23. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  24. Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N MICCAI M2CAI Challenge. http://camma.u-strasbg.fr/m2cai2016/

  25. Al Hajj H, Lamard M, Conze P-H, Roychowdhury S, Hu X, Maršalkaitė G, Zisimopoulos O (2019) Cataracts: Challenge on automatic tool annotation for cataract surgery. Med image anal 52:24–41

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This work is supported in whole, or in part by CUHK Shun Hing Institute of Advanced Engineering (project MMT-p5-20); and the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) [203145/Z/16/Z]; and Horizon 2020 FET (863146). For the purpose of open access, the author has applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission Robotics Institute.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Dou.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain patient data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, Y., Long, Y., Gao, X. et al. Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis. Int J CARS 17, 2193–2202 (2022). https://doi.org/10.1007/s11548-022-02743-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-022-02743-8

Keywords

Navigation