Abstract
Purpose
Semantic segmentation in surgical videos has applications in intra-operative guidance, post-operative analytics and surgical education. Models need to provide accurate predictions since temporally inconsistent identification of anatomy can hinder patient safety. We propose a novel architecture for modelling temporal relationships in videos to address these issues.
Methods
We developed a temporal segmentation model that includes a static encoder and a spatio-temporal decoder. The encoder processes individual frames whilst the decoder learns spatio-temporal relationships from frame sequences. The decoder can be used with any suitable encoder to improve temporal consistency.
Results
Model performance was evaluated on the CholecSeg8k dataset and a private dataset of robotic Partial Nephrectomy procedures. Mean Intersection over Union improved by 1.30% and 4.27% respectively for each dataset when the temporal decoder was applied. Our model also displayed improvements in temporal consistency up to 7.23%.
Conclusions
This work demonstrates an advance in video segmentation of surgical scenes with potential applications in surgery with a view to improve patient outcomes. The proposed decoder can extend state-of-the-art static models, and it is shown that it can improve per-frame segmentation output and video temporal consistency.
Similar content being viewed by others
References
Hong W-Y, Kao C-L, Kuo Y-H, Wang J-R, Chang W-L, Shih C-S (2021) Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. In: 12th international conference on information processing in computer-assisted interventions
Guerrero DT, Asaad M, Rajesh A, Hassan A, Butler CE (2022) Advancing surgical education: the use of artificial intelligence in surgical training. Am Surg 89(1):49–54
Hashimoto DA, Rosman G, Rus D, Meireles OR (2018) Artificial intelligence in surgery: promises and perils. Ann Surg 268(1):70–76
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012
Zhou T, Porikli F, Crandall DJ, Gool LV, Wang W (2022) A survey on deep learning technique for video segmentation. IEEE Trans Pattern Anal Mach Intell 1–20
Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1290–1299
González C, Bravo-Sánchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 595–605
Zhao Z, Jin Y, Gao X, Dou Q, Heng P-A (2020) Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 679–689
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Varghese S, Bayzidi Y, Bar A, Kapoor N, Lahiri S, Schneider JD, Schmidt NM, Schlicht P, Huger F, Fingscheidt T (2020) Unsupervised temporal consistency metric for video segmentation in highly-automated driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 336–337
Puyal JG-B, Bhatia KK, Brandao P, Ahmad OF, Toth D, Kader R, Lovat L, Mountney P, Stoyanov D (2020) Endoscopic polyp segmentation using a hybrid 2D/3D CNN. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 295–305
Wang B, Li L, Nakashima Y, Kawasaki R, Nagahara H, Yagi Y (2021) Noisy-lstm: improving temporal awareness for video semantic segmentation. IEEE Access 9:46810–46820
Liu Y, Shen C, Yu C, Wang J (2020) Efficient semantic video segmentation with per-frame inference. In: European conference on computer vision, Springer, pp 352–368
Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8866–8875
Farha YA, Gall J (2019) MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3575–3584
Hou J, Wang G, Chen X, Xue J-H, Zhu R, Yang H (2018) Spatial-temporal attention RES-TCN for skeleton-based dynamic hand gesture recognition. In: Proceedings of the European conference on computer vision (ECCV) workshops
Teed Z, Deng J (2020) Raft: recurrent all-pairs field transforms for optical flow. In: European conference on computer vision, Springer, pp 402–419
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Drs. Grammatikopoulou, Sanchez-Matilla, Bragman, Owen, Culshaw, Kerr, Luengo and Prof. Stoyanov are employees of Medtronic plc. Prof. Stoyanov is a co-founder and share- holder in Odin Vision, Ltd.
Ethical approval
Medtronic plc maintains all necessary rights and consents to process, analyze and display the private data referenced in this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Grammatikopoulou, M., Sanchez-Matilla, R., Bragman, F. et al. A spatio-temporal network for video semantic segmentation in surgical videos. Int J CARS 19, 375–382 (2024). https://doi.org/10.1007/s11548-023-02971-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11548-023-02971-6