Abstract
Purpose
The recent segment anything model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (1) the lack of per-frame prompts for supervised learning, (2) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (3) it is expensive to annotate prompts for offline applications.
Methods
We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation.
Results
The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018
Conclusion
Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods
Avoid common mistakes on your manuscript.
Introduction
Robot-assist surgery is gaining increasing attention in the research field of intelligent robots. Some existing works apply deep learning techniques to realise instance segmentation for surgical instruments. While these models have significantly advanced instance segmentation performance on surgical datasets, they have yet to fully harness the capabilities of either the most recent segmentation models or the advanced object detection model, which presents an opportunity for further refinement and enhancement. The well-known segmentation foundation model, SAM (segment anything model) [1], and adaptations of SAM in medical image segmentation and surgical instrument segmentation [2] have shown great promise in semantic segmentation. However, they cannot produce object label segmentation, and they require interactive prompting during the deployment period, which is not realistic.
In this work, we (1) propose Surgical-DeSAM to generate automatic bounding box prompting for a decoupling SAM; (2) design Swin-DETR by replacing ResNet with Swin-transformer as image feature extractor of the DETR [3]; (3) decouple SAM (DeSAM) by replacing SAM’s image encoder with DETR’s encoder; (4) validate on two publicly available surgical instrument segmentation datasets of EndoVis17 and EndoVis18; and (5) demonstrate the robustness compared to the SOTA models.
Methodology
Preliminaries
SAM
SAM [1] is the foundation model for prompt-based image segmentation and is trained on the largest segmentation dataset with over 1 billion high-quality masks. SAM forms a simple-designed transformer and composed of a heavyweight image encoder, a prompt encoder, and a lightweight mask decoder. The image encoder can directly extract image features from input images without the need for a backbone model, while its lightweight prompt encoder can dynamically transform any given prompt into an embedding vector in real-time. These embeddings are then processed by a decoder, generating precise segmentation masks. Prompts have various types, including points, boxes, text, or masks, which limit the SAM’s ability to be directly utilised for real-world applications like surgical instrument segmentation during surgery. It is unrealistic to provide a prompt for each frame of the surgical video.
DETR
DETR [3] is the transformer-based detector called DETR (DEtection TRansformer) for object detection. It consists of a CNN backbone, an encoder-decoder transformer and feed-forward networks (FFNs). The CNN backbone is the commonly used ResNet50 [4] which extract the feature (\(\in \Re ^{d\times H\times W}\)) representation from the input image (\(\in \Re ^{3\times H_0\times W_0}\)). The output of the backbone then passes to the transformer encoder with spatial positional encoding and produces object queries and encoder memory. The decoder receives the encoder outputs and predicts the class labels and bounding boxes with centre coordinates, height and width using FFNs.
Surgical-DeSAM
As shown in Fig. 1, we proposed Surgical-DeSAM to automate the bounding box prompting by designing (1) Swin-DETR: replacing ResNet50 of the DETR with Swin-transformer to design an efficient model for surgical instrument detection; (2) decoupling SAM: Replacing SAM image encoder with DETR Encoder and training end-to-end detection to prompt mask decoder of the SAM to segment surgical instrument.
SWIN-DETR
DETR utilises ResNet50 as the backbone CNN to extract the feature representation. However, as vision-transformer-based networks are showing much better performance than CNN, we replace the backbone network with a recent transformer-based architecture of Swin-transformer [5] and from our Swin-DETR as presented in Fig. 1. The Swin-transformer introduces a shifted window-based hierarchical transformer to add greater efficiency in the self-attention computation. It is important to note that the output of the Swin-transformer can be directly fed to the DETR encoder, where there is an additional step to collapse the spatial dimension of the ResNet50 feature into one dimension to convert it into a sequence of input for the transformer. Overall, SWIN-DETR consists of a Swin-transformer to extract the image feature, which is then passed to the transformer encoder-decoder and FFNs to obtain the final object class predictions and corresponding bounding boxes. More specifically, ResNet5o requires to convert feature map of \(f_\textrm{resnet} \in \Re ^{d\times H\times W}\) into \(f \in \Re ^{d\times HW}\) by collapsing the spatial dimensions where Swin-transformer directly produces output feature map of \(f_\textrm{swin} \in \Re ^{d\times HW}\).
Decoupling SAM
As the image encoder of the SAM and the DETR are performing similar feature extraction, we decouple SAM by removing the image encoder and feeding the DETR encoder output directly to the mask decoder. This facilitates the train end-to-end segmentation model using DETR predicted detection prompt and a decoupled SAM of prompt encoder and mask decoder only. During the training period, we utilise both ground-truths of the detection bounding boxes and segmentation masks to train both models end-to-end. To calculate losses, we adopted box loss \({\mathcal {L}}_\textrm{box}\) combining GIoU [6] and \(l_1\) losses for the detection task following DETR and dice coefficient similarity (DSC) loss \({\mathcal {L}}_\textrm{dsc}\) for the segmentation task. Therefore, total loss \(\textrm{Loss}_\textrm{total}\) can be formulated as:
Experiment and results
Dataset
We utilise two benchmark robotic instrument segmentation datasets of EndoVis17Footnote 1 and EndoVis18.Footnote 2 The dataset consists of instrument segmentation for different video sequences. We split the EndoVis17 first video sequences of 1 to 8 for the training and the remaining sequences of 8 and 9 for the testing. For EndoVis18, we split the sequences of 2, 5, 9, and 15 for testing and the remaining sequences for training followed by ISINet [7].
Implementation details
We choose AdamW optimiser with the learning rate of \(10^{-4}\) and weight decay of 0.1 to update the model parameters. The baseline DETR and SAM codes are adopted from the official repositories which utilise Pytorch framework for deep learning network.
Results
We conduct experiments on both object detection and semantic segmentation tasks on the robotic instrument dataset and obtain the instance segmentation performance of our model. Table 1 shows the comparison of performances of our model and other SOTA models for robotic instrument instance segmentation on Endovis 17 and Endovis 18 datasets. It is obvious that our Surgical-DeSAM outperforms the other SOTA segmentation models on both mIoU and DICE scores. The qualitative visualisation of the predictions is presented in Fig. 2. There are almost no false positives with our model as it segments the whole instrument based on the bounding box class predicted by the Swin-DETR. We observed the high detection performance with Swin-DETR at Table 2 where predicted bounding boxes are mostly accurate with slight deviation of the box regions.
Ablation study
To investigate the superiority of the Swin-transformer [5] backbone over ResNet50 [4], we conducted an ablation study focusing on detection tasks alone and on both detection prompt and segmentation tasks. In Table 2, the first two rows demonstrate superior detection performance of DETR-SwinB (DETR with Swin-transformer) compared to DETR-R50 (DETR with ResNet50). Conversely, the subsequent rows compare the results of Surgical-DeSAM with ResNet50 and Swin-transformer backbones. It is evident that Surgical-DeSAM with a Swin-transformer backbone significantly outperforms Surgical-DeSAM with a ResNet50 backbone, achieving a 2.7% higher mAP in the detection task and a 7.1% higher DICE score in the segmentation task.
Discussion and conclusion
In this paper, we have presented a novel model architecture, Surgical-DeSAM, by decoupling SAM to automate the bounding box prompting for surgical instrument segmentation. To get better feature extraction, we replaced ResNet50 with the Swin-transformer for instrument detection. To automate the bounding box prompting, we decouple the SAM by removing the image encoder and feeding the DETR encoder features and predicted bounding boxes to the SAM mask decoder and prompt encoder to obtain the final segmentation. The experimental results demonstrate the efficiency of our model by comparing it with other state-of-the-art segmentation techniques for surgical instrument segmentation. Future work could focus on the robustness and reliability of the Surgical-DeSAM-based detection and segmentation tasks.
Code availability
The source code of this work is available at https://github.com/YuyangSheng/Surgical-DeSAM.
References
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, et al (2023) Segment anything. arXiv preprint arXiv:2304.02643
Ma J, Wang B (2023) Segment anything in medical images. arXiv preprint arXiv:2304.12306
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. Springer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666
González C, Bravo-Sánchez L, Arbelaez P (2020) Isinet: an instance-based approach for surgical instrument segmentation. In: Conference on medical image computing and computer-assisted intervention, pp 595–605. Springer
Iglovikov V, Shvets A (2018) Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746
Jin Y, Cheng K, Dou Q, Heng P-A (2019) Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In: Medical image computing and computer assisted intervention–MICCAI 2019: 22nd international conference, Shenzhen, China, Proceedings, Part V 22, pp 440–448. Springer
Zhao Z, Jin Y, Gao X, Dou Q, Heng P-A (2020) Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: Medical image computing and computer assisted intervention–MICCAI 2020: 23rd International conference, Lima, Peru, Proceedings, Part III 23, pp 679–689. Springer
Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2022) Trackformer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8844–8854
Zhao Z, Jin Y, Heng P-A (2022) Trasetr: track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. In: 2022 International conference on robotics and automation (ICRA), pp 11186–11193. IEEE
Baby B, et al (2023) From forks to forceps: a new framework for instance segmentation of surgical instruments. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 6191–6201
Yue W, Zhang J, Hu K, Xia Y, Luo J, Wang Z (2023) Surgicalsam: efficient class promptable surgical instrument segmentation. arXiv preprint arXiv:2308.08746
Wang A, Islam M, Xu M, Zhang Y, Ren H (2023) Sam meets robotic surgery: an empirical study on generalization, robustness and adaptation. Medical image computing and computer assisted intervention— MICCAI 2023 workshops: ISIC 2023. Care-AI 2023, MedAGI 2023, DeCaF 2023, held in conjunction with MICCAI 2023, Vancouver, BC, Canada, Proceedings. Springer, Berlin, Heidelberg, pp 234–244
Acknowledgements
This work was carried during the dissertation project of Yuyang Sheng MSc in Robotics and Computation, Department of Computer Science, University College London. This work was supported in whole, or in part, by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) [203145/Z/16/Z] and the Engineering and Physical Sciences Research Council (EPSRC) [EP/W00805X/1, EP/Y01958X/1].
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
This articles does not contain patient data.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sheng, Y., Bano, S., Clarkson, M.J. et al. Surgical-DeSAM: decoupling SAM for instrument segmentation in robotic surgery. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03163-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11548-024-03163-6