Introduction

Robot-assist surgery is gaining increasing attention in the research field of intelligent robots. Some existing works apply deep learning techniques to realise instance segmentation for surgical instruments. While these models have significantly advanced instance segmentation performance on surgical datasets, they have yet to fully harness the capabilities of either the most recent segmentation models or the advanced object detection model, which presents an opportunity for further refinement and enhancement. The well-known segmentation foundation model, SAM (segment anything model) [1], and adaptations of SAM in medical image segmentation and surgical instrument segmentation [2] have shown great promise in semantic segmentation. However, they cannot produce object label segmentation, and they require interactive prompting during the deployment period, which is not realistic.

Fig. 1
figure 1

Surgical-DeSAM: Swin-DETR detector and decoupling SAM for instrument segmentation

In this work, we (1) propose Surgical-DeSAM to generate automatic bounding box prompting for a decoupling SAM; (2) design Swin-DETR by replacing ResNet with Swin-transformer as image feature extractor of the DETR [3]; (3) decouple SAM (DeSAM) by replacing SAM’s image encoder with DETR’s encoder; (4) validate on two publicly available surgical instrument segmentation datasets of EndoVis17 and EndoVis18; and (5) demonstrate the robustness compared to the SOTA models.

Methodology

Preliminaries

SAM

SAM [1] is the foundation model for prompt-based image segmentation and is trained on the largest segmentation dataset with over 1 billion high-quality masks. SAM forms a simple-designed transformer and composed of a heavyweight image encoder, a prompt encoder, and a lightweight mask decoder. The image encoder can directly extract image features from input images without the need for a backbone model, while its lightweight prompt encoder can dynamically transform any given prompt into an embedding vector in real-time. These embeddings are then processed by a decoder, generating precise segmentation masks. Prompts have various types, including points, boxes, text, or masks, which limit the SAM’s ability to be directly utilised for real-world applications like surgical instrument segmentation during surgery. It is unrealistic to provide a prompt for each frame of the surgical video.

Table 1 Performance comparison of the proposed Surgical-DeSAM model and the SOTA models on EndoVis 2017 and 2018
Fig. 2
figure 2

Comparison of the instance segmentation results with other models

DETR

DETR [3] is the transformer-based detector called DETR (DEtection TRansformer) for object detection. It consists of a CNN backbone, an encoder-decoder transformer and feed-forward networks (FFNs). The CNN backbone is the commonly used ResNet50 [4] which extract the feature (\(\in \Re ^{d\times H\times W}\)) representation from the input image (\(\in \Re ^{3\times H_0\times W_0}\)). The output of the backbone then passes to the transformer encoder with spatial positional encoding and produces object queries and encoder memory. The decoder receives the encoder outputs and predicts the class labels and bounding boxes with centre coordinates, height and width using FFNs.

Surgical-DeSAM

As shown in Fig. 1, we proposed Surgical-DeSAM to automate the bounding box prompting by designing (1) Swin-DETR: replacing ResNet50 of the DETR with Swin-transformer to design an efficient model for surgical instrument detection; (2) decoupling SAM: Replacing SAM image encoder with DETR Encoder and training end-to-end detection to prompt mask decoder of the SAM to segment surgical instrument.

SWIN-DETR

DETR utilises ResNet50 as the backbone CNN to extract the feature representation. However, as vision-transformer-based networks are showing much better performance than CNN, we replace the backbone network with a recent transformer-based architecture of Swin-transformer [5] and from our Swin-DETR as presented in Fig. 1. The Swin-transformer introduces a shifted window-based hierarchical transformer to add greater efficiency in the self-attention computation. It is important to note that the output of the Swin-transformer can be directly fed to the DETR encoder, where there is an additional step to collapse the spatial dimension of the ResNet50 feature into one dimension to convert it into a sequence of input for the transformer. Overall, SWIN-DETR consists of a Swin-transformer to extract the image feature, which is then passed to the transformer encoder-decoder and FFNs to obtain the final object class predictions and corresponding bounding boxes. More specifically, ResNet5o requires to convert feature map of \(f_\textrm{resnet} \in \Re ^{d\times H\times W}\) into \(f \in \Re ^{d\times HW}\) by collapsing the spatial dimensions where Swin-transformer directly produces output feature map of \(f_\textrm{swin} \in \Re ^{d\times HW}\).

Decoupling SAM

As the image encoder of the SAM and the DETR are performing similar feature extraction, we decouple SAM by removing the image encoder and feeding the DETR encoder output directly to the mask decoder. This facilitates the train end-to-end segmentation model using DETR predicted detection prompt and a decoupled SAM of prompt encoder and mask decoder only. During the training period, we utilise both ground-truths of the detection bounding boxes and segmentation masks to train both models end-to-end. To calculate losses, we adopted box loss \({\mathcal {L}}_\textrm{box}\) combining GIoU [6] and \(l_1\) losses for the detection task following DETR and dice coefficient similarity (DSC) loss \({\mathcal {L}}_\textrm{dsc}\) for the segmentation task. Therefore, total loss \(\textrm{Loss}_\textrm{total}\) can be formulated as:

$$\begin{aligned} \textrm{Loss}_\textrm{total} = {\mathcal {L}}_\textrm{box} + {\mathcal {L}}_\textrm{dsc} \end{aligned}$$
(1)

Experiment and results

Dataset

We utilise two benchmark robotic instrument segmentation datasets of EndoVis17Footnote 1 and EndoVis18.Footnote 2 The dataset consists of instrument segmentation for different video sequences. We split the EndoVis17 first video sequences of 1 to 8 for the training and the remaining sequences of 8 and 9 for the testing. For EndoVis18, we split the sequences of 2, 5, 9, and 15 for testing and the remaining sequences for training followed by ISINet [7].

Table 2 Comparison of DETR and our model with different backbone networks for the detection and segmentation tasks

Implementation details

We choose AdamW optimiser with the learning rate of \(10^{-4}\) and weight decay of 0.1 to update the model parameters. The baseline DETR and SAM codes are adopted from the official repositories which utilise Pytorch framework for deep learning network.

Results

We conduct experiments on both object detection and semantic segmentation tasks on the robotic instrument dataset and obtain the instance segmentation performance of our model. Table 1 shows the comparison of performances of our model and other SOTA models for robotic instrument instance segmentation on Endovis 17 and Endovis 18 datasets. It is obvious that our Surgical-DeSAM outperforms the other SOTA segmentation models on both mIoU and DICE scores. The qualitative visualisation of the predictions is presented in Fig. 2. There are almost no false positives with our model as it segments the whole instrument based on the bounding box class predicted by the Swin-DETR. We observed the high detection performance with Swin-DETR at Table 2 where predicted bounding boxes are mostly accurate with slight deviation of the box regions.

Ablation study

To investigate the superiority of the Swin-transformer [5] backbone over ResNet50 [4], we conducted an ablation study focusing on detection tasks alone and on both detection prompt and segmentation tasks. In Table 2, the first two rows demonstrate superior detection performance of DETR-SwinB (DETR with Swin-transformer) compared to DETR-R50 (DETR with ResNet50). Conversely, the subsequent rows compare the results of Surgical-DeSAM with ResNet50 and Swin-transformer backbones. It is evident that Surgical-DeSAM with a Swin-transformer backbone significantly outperforms Surgical-DeSAM with a ResNet50 backbone, achieving a 2.7% higher mAP in the detection task and a 7.1% higher DICE score in the segmentation task.

Discussion and conclusion

In this paper, we have presented a novel model architecture, Surgical-DeSAM, by decoupling SAM to automate the bounding box prompting for surgical instrument segmentation. To get better feature extraction, we replaced ResNet50 with the Swin-transformer for instrument detection. To automate the bounding box prompting, we decouple the SAM by removing the image encoder and feeding the DETR encoder features and predicted bounding boxes to the SAM mask decoder and prompt encoder to obtain the final segmentation. The experimental results demonstrate the efficiency of our model by comparing it with other state-of-the-art segmentation techniques for surgical instrument segmentation. Future work could focus on the robustness and reliability of the Surgical-DeSAM-based detection and segmentation tasks.