Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis

Jin, Yueming; Long, Yonghao; Gao, Xiaojie; Stoyanov, Danail; Dou, Qi; Heng, Pheng-Ann

doi:10.1007/s11548-022-02743-8

Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis

Original Article
Published: 21 September 2022

Volume 17, pages 2193–2202, (2022)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Yueming Jin¹,
Yonghao Long²,
Xiaojie Gao²,
Danail Stoyanov¹,
Qi Dou^2,3 &
…
Pheng-Ann Heng^2,3

1560 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Purpose

Real-time surgical workflow analysis has been a key component for computer-assisted intervention system to improve cognitive assistance. Most existing methods solely rely on conventional temporal models and encode features with a successive spatial–temporal arrangement. Supportive benefits of intermediate features are partially lost from both visual and temporal aspects. In this paper, we rethink feature encoding to attend and preserve the critical information for accurate workflow recognition and anticipation.

Methods

We introduce Transformer in surgical workflow analysis, to reconsider complementary effects of spatial and temporal representations. We propose a hybrid embedding aggregation Transformer, named Trans-SVNet, to effectively interact with the designed spatial and temporal embeddings, by employing spatial embedding to query temporal embedding sequence. We jointly optimized by loss objectives from both analysis tasks to leverage their high correlation.

Results

We extensively evaluate our method on three large surgical video datasets. Our method consistently outperforms the state-of-the-arts across three datasets on workflow recognition task. Jointly learning with anticipation, recognition results can gain a large improvement. Our approach also shows its effectiveness on anticipation with promising performance achieved. Our model achieves a real-time inference speed of 0.0134 second per frame.

Conclusion

Experimental results demonstrate the efficacy of our hybrid embeddings integration by rediscovering the crucial cues from complementary spatial–temporal embeddings. The better performance by multi-task learning indicates that anticipation task brings the additional knowledge to recognition task. Promising effectiveness and efficiency of our method also show its promising potential to be used in operating room.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer

Temporal-based Swin Transformer network for workflow recognition of surgical video

Article 04 November 2022

Data-centric multi-task surgical phase estimation with sparse scene segmentation

Article Open access 03 May 2022

References

Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S (2017) Surgical data science for next-generation interventions. Nature Biomedical Engineering
Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minimally Invasive Therapy & Allied Technol 28(2):82–90
Article Google Scholar
Maier-Hein L, Eisenmann M, Sarikaya D, März K, Collins T, Malpani A, Fallert J, Feussner H, Giannarou S, Mascagni P (2022) Surgical data science-from concepts toward clinical translation. Med image anal 76:102306
Article PubMed Google Scholar
Rivoir D, Bodenstedt S, Funke I, Bechtolsheim Fv, Distler M, Weitz J, Speidel S (2020) Rethinking anticipation tasks: Uncertainty-aware anticipation of sparse surgical instrument usage for context-aware assistance. In: MICCAI, pp 752–762. Springer
Yuan K, Holden M, Gao S, Lee W-S (2021) Surgical workflow anticipation using instrument interaction. In: MICCAI, pp 615–625. Springer
Forestier G, Riffaud L, Jannin P (2015) Automatic phase prediction from low-level surgical activities. IJCARS 10(6):833–841
Google Scholar
Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2017) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE TMI 36(1):86–97
Google Scholar
Lalys F, Bouget D, Riffaud L, Jannin P (2013) Automatic knowledge-based recognition of low-level tasks in ophthalmological procedures. IJCARS 8(1):39–49
Google Scholar
Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C-W, Heng P-A (2018) SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE TMI 37(5):1114–1126
Google Scholar
Yi F, Jiang T (2019) Hard frame detection and online mapping for surgical phase recognition. In: MICCAI
Twinanda AP, Yengera G, Mutter D, Marescaux J, Padoy N (2018) Rsdnet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE TMI 38(4):1069–1078
Google Scholar
Funke I, Bodenstedt S, Oehme F, von Bechtolsheim F, Weitz J, Speidel S (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In: MICCAI
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: CVPR, pp 156–165
Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: MICCAI
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: ECCV, pp 214–229. Springer
Wang Y, Solomon JM (2019) Deep closest point: Learning representations for point cloud registration. In: CVPR, pp 3523–3532
Jin Y, Li H, Dou Q, Chen H, Qin J, Fu C-W, Heng P-A (2020) Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med Image Anal 59:101572
Article PubMed Google Scholar
Zhang J, Nie Y, Lyu Y, Yang X, Chang J, Zhang JJ (2021) Sd-net: joint surgical gesture recognition and skill assessment. IJCARS 16(10):1675–1682
CAS Google Scholar
Franke S, Neumuth T (2015) Adaptive surgical process models for prediction of surgical work steps from surgical low-level activities. In: 6th Workshop on M2CAI at MICCAI
Gao X, Jin Y, Long Y, Dou Q, Heng P-A (2021) Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: MICCAI, pp 593–603. Springer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N MICCAI M2CAI Challenge. http://camma.u-strasbg.fr/m2cai2016/
Al Hajj H, Lamard M, Conze P-H, Roychowdhury S, Hu X, Maršalkaitė G, Zisimopoulos O (2019) Cataracts: Challenge on automatic tool annotation for cataract surgery. Med image anal 52:24–41
Article PubMed Google Scholar

Download references

Acknowledgements

This work is supported in whole, or in part by CUHK Shun Hing Institute of Advanced Engineering (project MMT-p5-20); and the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) [203145/Z/16/Z]; and Horizon 2020 FET (863146). For the purpose of open access, the author has applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission Robotics Institute.

Author information

Authors and Affiliations

Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS), Department of Computer Science, University College, London, UK
Yueming Jin & Danail Stoyanov
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, HK, China
Yonghao Long, Xiaojie Gao, Qi Dou & Pheng-Ann Heng
Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Shatin, HK, China
Qi Dou & Pheng-Ann Heng

Authors

Yueming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yonghao Long
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Gao
View author publications
You can also search for this author in PubMed Google Scholar
Danail Stoyanov
View author publications
You can also search for this author in PubMed Google Scholar
Qi Dou
View author publications
You can also search for this author in PubMed Google Scholar
Pheng-Ann Heng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Dou.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain patient data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jin, Y., Long, Y., Gao, X. et al. Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis. Int J CARS 17, 2193–2202 (2022). https://doi.org/10.1007/s11548-022-02743-8

Download citation

Received: 29 March 2022
Accepted: 31 August 2022
Published: 21 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11548-022-02743-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trans-SVNet: hybrid embedding aggregation Transformer for surgical workflow analysis