A spatio-temporal network for video semantic segmentation in surgical videos

Grammatikopoulou, Maria; Sanchez-Matilla, Ricardo; Bragman, Felix; Owen, David; Culshaw, Lucy; Kerr, Karen; Stoyanov, Danail; Luengo, Imanol

doi:10.1007/s11548-023-02971-6

A spatio-temporal network for video semantic segmentation in surgical videos

Original Article
Published: 22 June 2023

Volume 19, pages 375–382, (2024)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Maria Grammatikopoulou ORCID: orcid.org/0009-0002-8345-0850¹,
Ricardo Sanchez-Matilla¹,
Felix Bragman¹,
David Owen¹,
Lucy Culshaw¹,
Karen Kerr¹,
Danail Stoyanov^1,2 &
…
Imanol Luengo¹

742 Accesses
1 Citation
Explore all metrics

Abstract

Purpose

Semantic segmentation in surgical videos has applications in intra-operative guidance, post-operative analytics and surgical education. Models need to provide accurate predictions since temporally inconsistent identification of anatomy can hinder patient safety. We propose a novel architecture for modelling temporal relationships in videos to address these issues.

Methods

We developed a temporal segmentation model that includes a static encoder and a spatio-temporal decoder. The encoder processes individual frames whilst the decoder learns spatio-temporal relationships from frame sequences. The decoder can be used with any suitable encoder to improve temporal consistency.

Results

Model performance was evaluated on the CholecSeg8k dataset and a private dataset of robotic Partial Nephrectomy procedures. Mean Intersection over Union improved by 1.30% and 4.27% respectively for each dataset when the temporal decoder was applied. Our model also displayed improvements in temporal consistency up to 7.23%.

Conclusions

This work demonstrates an advance in video segmentation of surgical scenes with potential applications in surgery with a view to improve patient outcomes. The proposed decoder can extend state-of-the-art static models, and it is shown that it can improve per-frame segmentation output and video temporal consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Temporal Video Segmentation as an Auxiliary Task for Predicting the Remaining Surgery Duration

Data-centric multi-task surgical phase estimation with sparse scene segmentation

Article Open access 03 May 2022

Spinal Nerve Segmentation Method and Dataset Construction in Endoscopic Surgical Scenarios

References

Hong W-Y, Kao C-L, Kuo Y-H, Wang J-R, Chang W-L, Shih C-S (2021) Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. In: 12th international conference on information processing in computer-assisted interventions
Guerrero DT, Asaad M, Rajesh A, Hassan A, Butler CE (2022) Advancing surgical education: the use of artificial intelligence in surgical training. Am Surg 89(1):49–54
Article PubMed Google Scholar
Hashimoto DA, Rosman G, Rus D, Meireles OR (2018) Artificial intelligence in surgery: promises and perils. Ann Surg 268(1):70–76
Article PubMed Google Scholar
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012
Zhou T, Porikli F, Crandall DJ, Gool LV, Wang W (2022) A survey on deep learning technique for video segmentation. IEEE Trans Pattern Anal Mach Intell 1–20
Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1290–1299
González C, Bravo-Sánchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 595–605
Zhao Z, Jin Y, Gao X, Dou Q, Heng P-A (2020) Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 679–689
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Varghese S, Bayzidi Y, Bar A, Kapoor N, Lahiri S, Schneider JD, Schmidt NM, Schlicht P, Huger F, Fingscheidt T (2020) Unsupervised temporal consistency metric for video segmentation in highly-automated driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 336–337
Puyal JG-B, Bhatia KK, Brandao P, Ahmad OF, Toth D, Kader R, Lovat L, Mountney P, Stoyanov D (2020) Endoscopic polyp segmentation using a hybrid 2D/3D CNN. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 295–305
Wang B, Li L, Nakashima Y, Kawasaki R, Nagahara H, Yagi Y (2021) Noisy-lstm: improving temporal awareness for video semantic segmentation. IEEE Access 9:46810–46820
Article Google Scholar
Liu Y, Shen C, Yu C, Wang J (2020) Efficient semantic video segmentation with per-frame inference. In: European conference on computer vision, Springer, pp 352–368
Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8866–8875
Farha YA, Gall J (2019) MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3575–3584
Hou J, Wang G, Chen X, Xue J-H, Zhu R, Yang H (2018) Spatial-temporal attention RES-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European conference on computer vision (ECCV) workshops
Teed Z, Deng J (2020) Raft: recurrent all-pairs field transforms for optical flow. In: European conference on computer vision, Springer, pp 402–419

Download references

Author information

Authors and Affiliations

Medtronic plc, London, UK
Maria Grammatikopoulou, Ricardo Sanchez-Matilla, Felix Bragman, David Owen, Lucy Culshaw, Karen Kerr, Danail Stoyanov & Imanol Luengo
Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
Danail Stoyanov

Authors

Maria Grammatikopoulou
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Sanchez-Matilla
View author publications
You can also search for this author in PubMed Google Scholar
Felix Bragman
View author publications
You can also search for this author in PubMed Google Scholar
David Owen
View author publications
You can also search for this author in PubMed Google Scholar
Lucy Culshaw
View author publications
You can also search for this author in PubMed Google Scholar
Karen Kerr
View author publications
You can also search for this author in PubMed Google Scholar
Danail Stoyanov
View author publications
You can also search for this author in PubMed Google Scholar
Imanol Luengo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Grammatikopoulou.

Ethics declarations

Conflict of interest

Drs. Grammatikopoulou, Sanchez-Matilla, Bragman, Owen, Culshaw, Kerr, Luengo and Prof. Stoyanov are employees of Medtronic plc. Prof. Stoyanov is a co-founder and share- holder in Odin Vision, Ltd.

Ethical approval

Medtronic plc maintains all necessary rights and consents to process, analyze and display the private data referenced in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Grammatikopoulou, M., Sanchez-Matilla, R., Bragman, F. et al. A spatio-temporal network for video semantic segmentation in surgical videos. Int J CARS 19, 375–382 (2024). https://doi.org/10.1007/s11548-023-02971-6

Download citation

Received: 07 March 2023
Accepted: 19 May 2023
Published: 22 June 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11548-023-02971-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A spatio-temporal network for video semantic segmentation in surgical videos