Skip to main content
Log in

SF-TMN: SlowFast temporal modeling network for surgical phase recognition

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

Automatic surgical phase recognition is crucial for video-based assessment systems in surgical education. Utilizing temporal information is crucial for surgical phase recognition; hence, various recent approaches extract frame-level features to conduct full video temporal modeling.

Methods

For better temporal modeling, we propose SlowFast temporal modeling network (SF-TMN) for offline surgical phase recognition that can achieve not only frame-level full video temporal modeling but also segment-level full video temporal modeling. We employ a feature extraction network, pretrained on the target dataset, to extract features from video frames as the training data for SF-TMN. The Slow Path in SF-TMN utilizes all frame features for frame temporal modeling. The Fast Path in SF-TMN utilizes segment-level features summarized from frame features for segment temporal modeling. The proposed paradigm is flexible regarding the choice of temporal modeling networks.

Results

We explore MS-TCN and ASFormer as temporal modeling networks and experiment with multiple combination strategies for Slow and Fast Paths. We evaluate SF-TMN on Cholec80 and Cataract-101 surgical phase recognition tasks and demonstrate that SF-TMN can achieve state-of-the-art results on all considered metrics. SF-TMN with ASFormer backbone outperforms the state-of-the-art Swin BiGRU by approximately 1% in accuracy and 1.5% in recall on Cholec80. We also evaluate SF-TMN on action segmentation datasets including 50salads, GTEA, and Breakfast, and achieve state-of-the-art results.

Conclusion

The improvement in the results shows that combining temporal information from both frame level and segment level by refining outputs with temporal refinement stages is beneficial for the temporal modeling of surgical phases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Feldman LS, Pryor AD, Gardner AK, Dunkin BJ, Schultz L, Awad MM, Ritter EM (2020) Sages video-based assessment (vba) program: a vision for life-long learning for surgeons. Surg Endosc 34(8):3285–3288

    Article  PubMed  Google Scholar 

  2. Twinanda AP, Shehata S, Mutter D, Marescaux J, De Mathelin M, Padoy N (2016) Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86–97

    Article  PubMed  Google Scholar 

  3. Jin Y, Long Y, Gao X, Stoyanov D, Dou Q, Heng P-A (2022) Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. IJCARS 17(12):2193–2202

    Google Scholar 

  4. Zhang B, Abbing J, Ghanem A, Fer D, Barker J, Abukhalil R, Goel VK, Milletarì F (2022) Towards accurate surgical workflow recognition with convolutional networks and transformers. CMBBE: Imag Visual 10(4):349–356

    Google Scholar 

  5. Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C-W, Heng P-A (2017) Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imaging 37(5):1114–1126

    Article  Google Scholar 

  6. Zhang B, Ghanem A, Simes A, Choi H, Yoo A (2021) Surgical workflow recognition with 3dcnn for sleeve gastrectomy. IJCARS 16(11):2029–2036

    Google Scholar 

  7. Czempiel T, Paschali M, Keicher M, Simson W, Feussner H, Kim ST, Navab N (2020) Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: MICCAI. Springer, pp 343–352

  8. Fer D, Zhang B, Abukhalil R, Goel V, Goel B, Barker J, Kalesan B, Barragan I, Gaddis ML, Kilroy PG (2023) An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surg Endosc 1:1–8

    Google Scholar 

  9. Zhang B, Ghanem A, Simes A, Choi H, Yoo A, Min A (2021) Swnet: surgical workflow recognition with deep convolutional network. In: MIDL. PMLR, pp 855–869

  10. Ding X, Li X (2022) Exploring segment-level semantics for online phase recognition from surgical videos. IEEE Trans Med Imaging 41(11):3309–3319

    Article  PubMed  Google Scholar 

  11. Zhang B, Goel B, Sarhan MH, Goel VK, Abukhalil R, Kalesan B, Stottler N, Petculescu S (2022) Surgical workflow recognition with temporal convolution and transformer for action segmentation. IJCARS 1:1–10

    Google Scholar 

  12. Zhang B, Sturgeon D, Shankar AR, Goel VK, Barker J, Ghanem A, Lee P, Milecky M, Stottler N, Petculescu S (2022) Surgical instrument recognition for instrument usage documentation and surgical video library indexing. CMBBE Imag Visual 1:1–9

    Google Scholar 

  13. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6202–6211

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  15. Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: CVPR, pp 3575–3584. https://github.com/yabufarha/ms-tcn

  16. Yi F, Wen H, Jiang T (2021) Asformer: transformer for action segmentation. In: BMVC, p 236 . https://github.com/ChinaYi/ASFormer

  17. He Z, Mottaghi A, Sharghi A, Jamal MA, Mohareri O (2022) An empirical study on activity recognition in long surgical videos. In: Machine learning for health. PMLR, pp 356–372

  18. Schoeffmann K, Taschwer M, Sarny S, Münzer B, Primus MJ, Putzgruber D (2018) Cataract-101: video dataset of 101 cataract surgeries. In: Proceedings of the 9th ACM multimedia systems conference, pp 421–425

  19. Stein S, McKenna SJ (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, pp 729–738

  20. Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011. IEEE, pp 3281–3288

  21. Kuehne H, Arslan A, Serre T (2014) The language of actions: recovering the syntax and semantics of goal-directed human activities. In: CVPR, pp 780–787

  22. Ding X, Yan X, Wang Z, Zhao W, Zhuang J, Xu X, Li X (2023) Less is more: surgical phase recognition from timestamp supervision. IEEE Trans Med Imaging 42(6):1897–1910

    Article  PubMed  Google Scholar 

  23. Yi F, Yang Y, Jiang T (2022) Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: ACCV, pp 2613–2628

  24. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp 6299–6308

  25. Li S, Farha YA, Liu Y, Cheng M-M, Gall J (2023) Ms-tcn++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell 45(6):6647–6658

    Article  PubMed  Google Scholar 

  26. Funke I, Rivoir D, Speidel S (2023) Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961

  27. Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives for fine-grained action recognition. In: ICRA. IEEE, pp 1642–1649

  28. Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: CVPR, pp 156–165

  29. Li M, Chen L, Duan Y, Hu Z, Feng J, Zhou J, Lu J (2022) Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp 19880–19889

  30. Ishihara K, Nakano G, Inoshita T (2022) Mcfm: mutual cross fusion module for intermediate fusion-based action segmentation. In: ICIP. IEEE, pp 1701–1705

  31. Zhang Y, Bano S, Page A-S, Deprest J, Stoyanov D, Vasconcelos F (2022) Retrieval of surgical phase transitions using reinforcement learning. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 497–506

  32. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  33. Behrmann N, Golestaneh SA, Kolter Z, Gall J, Noroozi M (2022) Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: ECCV. Springer, pp 52–68

  34. Park J, Kim D, Huh S, Jo S (2022) Maximization and restoration: action segmentation through dilation passing and temporal reconstruction. Pattern Recogn 129:108764

    Article  Google Scholar 

  35. Aziere N, Todorovic S (2022) Multistage temporal convolution transformer for action segmentation. Image Vis Comput 128:104567

    Article  Google Scholar 

  36. Chen M-H, Li B, Bao Y, AlRegib G, Kira Z (2020) Action segmentation with joint self-supervised temporal domain adaptation. In: CVPR, pp 9454–9463

  37. Wang Z, Gao Z, Wang L, Li Z, Wu G (2020) Boundary-aware cascade networks for temporal action segmentation. In: ECCV. Springer, pp 34–51

  38. Ahn H, Lee D (2021) Refining action segmentation with hierarchical video representations. In: ICCV, pp 16302–16310

  39. Ishikawa Y, Kasai S, Aoki Y, Kataoka H (2021) Alleviating over-segmentation errors by detecting action boundaries. In: WACV, pp 2322–2331

  40. Chen L, Li M, Duan Y, Zhou J, Lu J (2022) Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol 2, p 6

  41. Du Z, Wang Q (2022) Dilated transformer with feature aggregation module for action segmentation. Neural Process Lett 1:1–17

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bokai Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

For this type of study, formal consent is not required.

Informed consent

This article does not contain patient data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Sarhan, M.H., Goel, B. et al. SF-TMN: SlowFast temporal modeling network for surgical phase recognition. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03095-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11548-024-03095-1

Keywords

Navigation