Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter

Purpose In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios. Methods We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN. Results Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module. Conclusion In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-024-03140-z.


Training Memory
We report the training memory for all blocks and the number of parameters of our model when employing ViT-S and ViT-B as the feature encoder for the ViT branch, as detailed below.

Statistical Significance
We performed a t-test using the data presented in Table 1 (Comparison on the Robust-MIS 2019 dataset between state-of-the-art models) in our manuscript to statistically validate the significance of our model's improvements.Sample groups were chosen from the Robust-MIS 2019 dataset, and we calculated the p-values to compare our model's performance against the current state-of-the-art methods.The p-value of our model against the next best-performing model indicates statistical significance at a standard 0.05 level.

Comprehensive Related Work 4.1 Surgical Instrument Segmentation
The majority of surgical instrument segmentation works are CNN-based methods.For example, ISINet [1] proposes an instance-based surgical instrument segmentation CNN network that includes a temporal consistency module to leverage temporal information.OR-UNet [2] is introduced as an optimized and robust 2D UNet [3] with residual blocks and multi-scale segmentation maps for instrument segmentation in endoscopic images.Although the majority of approaches rely on CNNs, there is a growing trend of exploring vision transformer-based methods in recent works.For instance, MATIS [4] is a fully transformer-based method that utilizes pixel-wise attention mechanisms and a masked attention module for surgical instrument segmentation while enhancing temporal consistency with video transformers.TraSeTR [5] introduces a Track-to-Segment transformer that leverages tracking cues, prior temporal knowledge, and contrastive query learning to enhance surgical instrument segmentation.

Pre-trained Vision Transformers
Driven by extensive pretraining on large datasets, ViT [6] employs masked patch prediction for self-supervised vision tasks.He et al. [7] presents masked autoencoders (MAE) for efficient self-supervised vision learning, using an asymmetric encoderdecoder and heavy input masking.Similarly, BEiT [8] advances the notion of predicting discrete tokens within vision models.Furthermore, DINO [9] investigates whether self-supervised learning imparts unique advantages to ViT architectures, particularly in terms of enhancing semantic segmentation capabilities.DINOV2 [10] continues to advance and enhance the training of large-scale ViT models with 1B parameters and distils it into a series of smaller models, surpassing the best available all-purpose features.The pre-trained vision transformers are successfully applied to the downstream tasks such as image classification [11,12], object detection [10], semantic segmentation [10,11], and video action classification [12].Research on fine-tuning cross-attention modules with pre-trained embeddings, such as in [13], aligns with our method of harnessing pre-trained knowledge from large-scale ViT models.Yet, there is no existing work that adapts pre-trained ViT features by a CNN adapter, crucial due to limited data availability [6].

Hybrid CNN and ViT Models
ViTs and CNNs inherently complement each other.The Swin transformer network [12] adapts the concept of an expanding receptive field from CNNs for use in ViT There are also numerous studies that advocate for the fusion of these two architectures to address the limitations of each model.For instance, TransUNet [14] hybrids in which ViT processes CNN-derived patches for global context and the decoder combines these with high-resolution CNN maps for diverse medical applications.By employing a parallel architecture that combines ViT and CNNs, TransFuse [15] efficiently captures global dependencies and low-level spatial details, featuring the novel BiFusion module for effective multi-level feature fusion.Recently, CTCNet [16] was designed for medical image segmentation, which combines Swin Transformers [17] and Residual CNNs using a cross-domain Fusion Block.There are also works that simulate the characteristics of CNN in their ViT models [13] or directly adopt the cross-attention mechanism to augment the CNN structure [18], but none of the existing work integrates crossattention into a CNN model to serve as a lightweight adapter for a pre-trained ViT model.

Detailed Datasets Introduction
We conducted our main evaluating experiments on Robust-MIS 2019 dataset, comprising 10,040 annotated endoscopic images from 30 surgeries [19].The training data involved 5,983 video clips from the proctocolectomy procedure and rectal resection procedure, with each clip's final frame annotated.Testing was structured in three stages: Stage 1 and 2 included 1,177 images, and Stage 3 introduced 2,880 images from the sigmoid resection, a procedure not present in the training set.We also performed cross-dataset validation within Robust-MIS 2019 dataset and other 4 surgical image datasets, including 1) EndoVis 2017 [20] which features robotic instrument images from robotic-assisted surgeries [16].The training set utilized 225 frames from 8 sequences, annotated for tool details.Testing involved the last 75 frames from these 8 sequences plus 2 full-length sequences (300 frames each) distinct from training; 2) EndoVis 2018 [21] that includes 15 video sequences, divided into 11 for training and 4 for testing, encompassing 7 specified instrument types; 3) CholecSeg8k [22] consists of 80 cholecystectomy surgery videos and each video provides 80 annotated frames summing up to 8,080 frames across 101 directories.4) AutoLaparo [23] is derived from full-length hysterectomy videos, resulting in a segmentation sub-dataset of 1,800 frames.Each dataset was split into training and validation subsets at an 8:2 ratio with no patient overlap across folds.

Figure 1 Fig. 1 Fig. 2
Figure 1 displays representative samples for the binary segmentation on Robust-MIS 2019 dataset across various performance scenarios, which shows our model demonstrates well-delineated edges, manages overlapping instruments, and ensures the detection of details.In the cases that have the lowest Mean Dice scores, there is boundary ambiguity in high-reflection scenarios.

Table 1
Training Memory and Parameter Count for Our Model