Skip to main content
Log in

A spatio-temporal network for video semantic segmentation in surgical videos

  • Original Article
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

Semantic segmentation in surgical videos has applications in intra-operative guidance, post-operative analytics and surgical education. Models need to provide accurate predictions since temporally inconsistent identification of anatomy can hinder patient safety. We propose a novel architecture for modelling temporal relationships in videos to address these issues.

Methods

We developed a temporal segmentation model that includes a static encoder and a spatio-temporal decoder. The encoder processes individual frames whilst the decoder learns spatio-temporal relationships from frame sequences. The decoder can be used with any suitable encoder to improve temporal consistency.

Results

Model performance was evaluated on the CholecSeg8k dataset and a private dataset of robotic Partial Nephrectomy procedures. Mean Intersection over Union improved by 1.30% and 4.27% respectively for each dataset when the temporal decoder was applied. Our model also displayed improvements in temporal consistency up to 7.23%.

Conclusions

This work demonstrates an advance in video segmentation of surgical scenes with potential applications in surgery with a view to improve patient outcomes. The proposed decoder can extend state-of-the-art static models, and it is shown that it can improve per-frame segmentation output and video temporal consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Hong W-Y, Kao C-L, Kuo Y-H, Wang J-R, Chang W-L, Shih C-S (2021) Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. In: 12th international conference on information processing in computer-assisted interventions

  2. Guerrero DT, Asaad M, Rajesh A, Hassan A, Butler CE (2022) Advancing surgical education: the use of artificial intelligence in surgical training. Am Surg 89(1):49–54

    Article  PubMed  Google Scholar 

  3. Hashimoto DA, Rosman G, Rus D, Meireles OR (2018) Artificial intelligence in surgery: promises and perils. Ann Surg 268(1):70–76

    Article  PubMed  Google Scholar 

  4. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364

    Article  Google Scholar 

  5. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012

  6. Zhou T, Porikli F, Crandall DJ, Gool LV, Wang W (2022) A survey on deep learning technique for video segmentation. IEEE Trans Pattern Anal Mach Intell 1–20

  7. Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1290–1299

  8. González C, Bravo-Sánchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 595–605

  9. Zhao Z, Jin Y, Gao X, Dou Q, Heng P-A (2020) Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 679–689

  10. Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  11. Varghese S, Bayzidi Y, Bar A, Kapoor N, Lahiri S, Schneider JD, Schmidt NM, Schlicht P, Huger F, Fingscheidt T (2020) Unsupervised temporal consistency metric for video segmentation in highly-automated driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 336–337

  12. Puyal JG-B, Bhatia KK, Brandao P, Ahmad OF, Toth D, Kader R, Lovat L, Mountney P, Stoyanov D (2020) Endoscopic polyp segmentation using a hybrid 2D/3D CNN. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 295–305

  13. Wang B, Li L, Nakashima Y, Kawasaki R, Nagahara H, Yagi Y (2021) Noisy-lstm: improving temporal awareness for video semantic segmentation. IEEE Access 9:46810–46820

    Article  Google Scholar 

  14. Liu Y, Shen C, Yu C, Wang J (2020) Efficient semantic video segmentation with per-frame inference. In: European conference on computer vision, Springer, pp 352–368

  15. Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8866–8875

  16. Farha YA, Gall J (2019) MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3575–3584

  17. Hou J, Wang G, Chen X, Xue J-H, Zhu R, Yang H (2018) Spatial-temporal attention RES-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European conference on computer vision (ECCV) workshops

  18. Teed Z, Deng J (2020) Raft: recurrent all-pairs field transforms for optical flow. In: European conference on computer vision, Springer, pp 402–419

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Grammatikopoulou.

Ethics declarations

Conflict of interest

Drs. Grammatikopoulou, Sanchez-Matilla, Bragman, Owen, Culshaw, Kerr, Luengo and Prof. Stoyanov are employees of Medtronic plc. Prof. Stoyanov is a co-founder and share- holder in Odin Vision, Ltd.

Ethical approval

Medtronic plc maintains all necessary rights and consents to process, analyze and display the private data referenced in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grammatikopoulou, M., Sanchez-Matilla, R., Bragman, F. et al. A spatio-temporal network for video semantic segmentation in surgical videos. Int J CARS 19, 375–382 (2024). https://doi.org/10.1007/s11548-023-02971-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11548-023-02971-6

Keywords

Navigation