SF-TMN: SlowFast temporal modeling network for surgical phase recognition

Zhang, Bokai; Sarhan, Mohammad Hasan; Goel, Bharti; Petculescu, Svetlana; Ghanem, Amer

doi:10.1007/s11548-024-03095-1

Bokai Zhang ORCID: orcid.org/0000-0003-1906-2116¹,
Mohammad Hasan Sarhan²^na1,
Bharti Goel³^na1,
Svetlana Petculescu¹ &
…
Amer Ghanem¹

138 Accesses
1 Altmetric
Explore all metrics

Abstract

Purpose

Automatic surgical phase recognition is crucial for video-based assessment systems in surgical education. Utilizing temporal information is crucial for surgical phase recognition; hence, various recent approaches extract frame-level features to conduct full video temporal modeling.

Methods

For better temporal modeling, we propose SlowFast temporal modeling network (SF-TMN) for offline surgical phase recognition that can achieve not only frame-level full video temporal modeling but also segment-level full video temporal modeling. We employ a feature extraction network, pretrained on the target dataset, to extract features from video frames as the training data for SF-TMN. The Slow Path in SF-TMN utilizes all frame features for frame temporal modeling. The Fast Path in SF-TMN utilizes segment-level features summarized from frame features for segment temporal modeling. The proposed paradigm is flexible regarding the choice of temporal modeling networks.

Results

We explore MS-TCN and ASFormer as temporal modeling networks and experiment with multiple combination strategies for Slow and Fast Paths. We evaluate SF-TMN on Cholec80 and Cataract-101 surgical phase recognition tasks and demonstrate that SF-TMN can achieve state-of-the-art results on all considered metrics. SF-TMN with ASFormer backbone outperforms the state-of-the-art Swin BiGRU by approximately 1% in accuracy and 1.5% in recall on Cholec80. We also evaluate SF-TMN on action segmentation datasets including 50salads, GTEA, and Breakfast, and achieve state-of-the-art results.

Conclusion

The improvement in the results shows that combining temporal information from both frame level and segment level by refining outputs with temporal refinement stages is beneficial for the temporal modeling of surgical phases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

U-Net: Convolutional Networks for Biomedical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Feldman LS, Pryor AD, Gardner AK, Dunkin BJ, Schultz L, Awad MM, Ritter EM (2020) Sages video-based assessment (vba) program: a vision for life-long learning for surgeons. Surg Endosc 34(8):3285–3288
Article PubMed Google Scholar
Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2016) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86–97
Article PubMed Google Scholar
Jin Y, Long Y, Gao X, Stoyanov D, Dou Q, Heng P-A (2022) Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. IJCARS 17(12):2193–2202
Google Scholar
Zhang B, Abbing J, Ghanem A, Fer D, Barker J, Abukhalil R, Goel VK, Milletarì F (2022) Towards accurate surgical workflow recognition with convolutional networks and transformers. CMBBE: Imag Visual 10(4):349–356
Google Scholar
Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C-W, Heng P-A (2017) Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imaging 37(5):1114–1126
Article Google Scholar
Zhang B, Ghanem A, Simes A, Choi H, Yoo A (2021) Surgical workflow recognition with 3dcnn for sleeve gastrectomy. IJCARS 16(11):2029–2036
Google Scholar
Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: MICCAI. Springer, pp 343–352
Fer D, Zhang B, Abukhalil R, Goel V, Goel B, Barker J, Kalesan B, Barragan I, Gaddis ML, Kilroy PG (2023) An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surg Endosc 1:1–8
Google Scholar
Zhang B, Ghanem A, Simes A, Choi H, Yoo A, Min A (2021) Swnet: surgical workflow recognition with deep convolutional network. In: MIDL. PMLR, pp 855–869
Ding X, Li X (2022) Exploring segment-level semantics for online phase recognition from surgical videos. IEEE Trans Med Imaging 41(11):3309–3319
Article PubMed Google Scholar
Zhang B, Goel B, Sarhan MH, Goel VK, Abukhalil R, Kalesan B, Stottler N, Petculescu S (2022) Surgical workflow recognition with temporal convolution and transformer for action segmentation. IJCARS 1:1–10
Google Scholar
Zhang B, Sturgeon D, Shankar AR, Goel VK, Barker J, Ghanem A, Lee P, Milecky M, Stottler N, Petculescu S (2022) Surgical instrument recognition for instrument usage documentation and surgical video library indexing. CMBBE Imag Visual 1:1–9
Google Scholar
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6202–6211
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: CVPR, pp 3575–3584. https://github.com/yabufarha/ms-tcn
Yi F, Wen H, Jiang T (2021) Asformer: transformer for action segmentation. In: BMVC, p 236 . https://github.com/ChinaYi/ASFormer
He Z, Mottaghi A, Sharghi A, Jamal MA, Mohareri O (2022) An empirical study on activity recognition in long surgical videos. In: Machine learning for health. PMLR, pp 356–372
Schoeffmann K, Taschwer M, Sarny S, Münzer B, Primus MJ, Putzgruber D (2018) Cataract-101: video dataset of 101 cataract surgeries. In: Proceedings of the 9th ACM multimedia systems conference, pp 421–425
Stein S, McKenna SJ (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, pp 729–738
Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011. IEEE, pp 3281–3288
Kuehne H, Arslan A, Serre T (2014) The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR, pp 780–787
Ding X, Yan X, Wang Z, Zhao W, Zhuang J, Xu X, Li X (2023) Less is more: surgical phase recognition from timestamp supervision. IEEE Trans Med Imaging 42(6):1897–1910
Article PubMed Google Scholar
Yi F, Yang Y, Jiang T (2022) Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: ACCV, pp 2613–2628
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 6299–6308
Li S, Farha YA, Liu Y, Cheng M-M, Gall J (2023) Ms-tcn++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell 45(6):6647–6658
Article PubMed Google Scholar
Funke I, Rivoir D, Speidel S (2023) Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961
Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives for fine-grained action recognition. In: ICRA. IEEE, pp 1642–1649
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: CVPR, pp 156–165
Li M, Chen L, Duan Y, Hu Z, Feng J, Zhou J, Lu J (2022) Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp 19880–19889
Ishihara K, Nakano G, Inoshita T (2022) Mcfm: mutual cross fusion module for intermediate fusion-based action segmentation. In: ICIP. IEEE, pp 1701–1705
Zhang Y, Bano S, Page A-S, Deprest J, Stoyanov D, Vasconcelos F (2022) Retrieval of surgical phase transitions using reinforcement learning. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 497–506
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Behrmann N, Golestaneh SA, Kolter Z, Gall J, Noroozi M (2022) Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: ECCV. Springer, pp 52–68
Park J, Kim D, Huh S, Jo S (2022) Maximization and restoration: action segmentation through dilation passing and temporal reconstruction. Pattern Recogn 129:108764
Article Google Scholar
Aziere N, Todorovic S (2022) Multistage temporal convolution transformer for action segmentation. Image Vis Comput 128:104567
Article Google Scholar
Chen M-H, Li B, Bao Y, AlRegib G, Kira Z (2020) Action segmentation with joint self-supervised temporal domain adaptation. In: CVPR, pp 9454–9463
Wang Z, Gao Z, Wang L, Li Z, Wu G (2020) Boundary-aware cascade networks for temporal action segmentation. In: ECCV. Springer, pp 34–51
Ahn H, Lee D (2021) Refining action segmentation with hierarchical video representations. In: ICCV, pp 16302–16310
Ishikawa Y, Kasai S, Aoki Y, Kataoka H (2021) Alleviating over-segmentation errors by detecting action boundaries. In: WACV, pp 2322–2331
Chen L, Li M, Duan Y, Zhou J, Lu J (2022) Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol 2, p 6
Du Z, Wang Q (2022) Dilated transformer with feature aggregation module for action segmentation. Neural Process Lett 1:1–17
Google Scholar

Download references

Author information

Mohammad Hasan Sarhan and Bharti Goel have contributed equally to this work.

Authors and Affiliations

Johnson & Johnson MedTech, 1100 Olive Way, Suite 1100, Seattle, WA, 98101, USA
Bokai Zhang, Svetlana Petculescu & Amer Ghanem
Johnson & Johnson MedTech, Robert-Koch-Straße 1, 22851, Norderstedt, Schleswig-Holstein, Germany
Mohammad Hasan Sarhan
Johnson & Johnson MedTech, 5490 Great America Pkwy, Santa Clara, CA, 95054, USA
Bharti Goel

Authors

Bokai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Hasan Sarhan
View author publications
You can also search for this author in PubMed Google Scholar
Bharti Goel
View author publications
You can also search for this author in PubMed Google Scholar
Svetlana Petculescu
View author publications
You can also search for this author in PubMed Google Scholar
Amer Ghanem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bokai Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

For this type of study, formal consent is not required.

Informed consent

This article does not contain patient data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, B., Sarhan, M.H., Goel, B. et al. SF-TMN: SlowFast temporal modeling network for surgical phase recognition. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03095-1

Download citation

Received: 14 June 2023
Accepted: 29 February 2024
Published: 21 March 2024
DOI: https://doi.org/10.1007/s11548-024-03095-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SF-TMN: SlowFast temporal modeling network for surgical phase recognition

Abstract

Purpose

Methods

Results

Conclusion

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SF-TMN: SlowFast temporal modeling network for surgical phase recognition

Abstract

Purpose

Methods

Results

Conclusion

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation