A faithful segmentation of surgical instruments in endoscopic videos is a crucial component of surgical scene understanding and realization of automation in computer- or robot-assisted intervention systems.Footnote 1 A majority of recent approaches address the problem of surgical instrument segmentation by training deep neural networks (DNNs) in a fully-supervised scheme. However, the applicability of such supervised approaches is restricted by the availability of a sufficiently large amount of real videos with clean annotations. The annotation process (especially pixel-wise) can be prohibitively expensive (see Fig. 1) because it takes valuable time of medical experts.

Fig. 1
figure 1

Fully supervised deep learning is unrealistic for instrument segmentation due to a significantly high annotation effort

An alternative direction to mitigate the dependency on annotated video sequences is to utilize synthetic data for the training of DNNs. Recent advances in graphics and simulation infrastructures have paved the way to automatically create a large number of photo-realistic simulated images with accurate pixel-level labels [14, 23]. However, the DNNs trained purely on simulated images do not generalize well on real endoscopic videos due to the domain shift/bias issue [30, 32]. We hypothesize that a DNN’s bias towards recognizing textures rather than shapes [4] results in a significant drop of performance when the DNNs are trained on simulation (rendered) data and applied to real environments. This is mainly because the heterogeneity of information within a real surgical scene is heavily influenced by factors such as lighting conditions, motion blur, blood, smoke, specular reflection, noise etc. However, simulation data only mimic shapes of instrument and patient-specific organs [10].

The research problem of learning a task-specific representation from the annotated data in a source domain (e.g. simulation) that generalizes on a different but related target domain (e.g. real) is commonly referred to as visual domain adaptation [33]. Unsupervised domain adaptation (UDA) is a specific scenario of domain adaptation where the annotations for the target domain are not available during learning [34]. Here, the primary goal is to learn domain-invariant feature representations for addressing the domain shift/bias [14, 35]. For instance, Pfeiffer et al. [23] utilized an image-to-image translation approach, where the simulated images are translated into realistic looking ones by mapping image styles (texture, lighting) of the real data using a Cycle-GAN. In contrast, we argued for a shape-focused joint learning from simulated and real data in an end-to-end fashion and introduced a consistency-learning-based approach Endo-Sim2Real [27] to align the DNNs on both domains. We showed that similar performance can be obtained by employing a non-adversarial approach while improving the computational efficiency (with respect to training time). However, similar to perturbation-based consistency learning approaches for image classification, [11, 17] Endo-Sim2Real, being a consistency learning approach at its core, suffers from so-called confirmation bias [29]. This is caused by noise accumulation or erroneous learning during the training stages, which may result in a degenerate solution [6].

In this work, we introduce the teacher–student learning paradigm to the task of surgical instrument segmentation in endoscopic videos. Our proposed approach tackles the erroneous learning by improving the pseudo-label generation procedure for the unlabeled data and facilitate stable training of DNNs while maintaining computational efficiency. Through quantitative and qualitative analysis, we show that our proposed approach outperforms the previous Endo-Sim2Real approach across three data sets. Moreover, the proposed approach leads to a stable training without loosing computational efficiency.

The contributions of our work are as follows:

  1. 1.

    We formalise the consistency-based unsupervised domain adaptation framework to identify the confirmation bias problem of Endo-Sim2Real and propose a teacher–student learning paradigm to address this problem.

  2. 2.

    We evaluate our work on three different datasets with varying degrees of the domain gap to show consistent improvement in the performance generalization capability of the DNN across the datasets and in presence of unseen instruments or multiple instrument combinations.

  3. 3.

    We provide a thorough quantitative and qualitative analysis to show the strengths and limitations of our approach. In particular, identification of the failure modes with respect to specific cases and scenarios in order to provide valuable insights into addressing the remaining performance gap.

Fig. 2
figure 2

Proposed teacher–student learning approach comprising supervised learning from source (simulation) data as well as consistency learning from unlabeled target (real) data

Related work

Research on instrument segmentation for endoscopic procedures is dominated by supervision-based approaches ranging from full supervision [5], semi/self-supervision [25], and weak supervision [12] up to multi-task [16] and multi-modal learning [15]. Some recent works also explored unsupervised approaches [7, 18], however, for the sake of brevity, we will only focus on approaches that employ learning from simulation data for unsupervised domain adaptation.

Within the context of domain adaptation in surgical domains, Mahmood et al. [20] proposed an adversarial-based transformer network to translate a real image to a synthetic image such that a depth estimation model trained on synthetic images can be applied to the real image. On the other hand, Rau et al. [24] proposed a conditional Generative Adversarial Network (GAN)-based approach to estimate depth directly from real images. Other works have argued for translating synthetic images to photo-realistic images by using domain mapping via style transfer [19, 21], for instance by using Cycle-GAN based unpaired image-to-image translation [9, 22] and utilize annotations from synthetic environment for deep learning tasks.

Pfeiffer et al. [23] proposed an unpaired image-to-image translation approach I2I that focuses on reducing the distribution difference between the source and the target domain by employing a Cycle-GAN-based style transfer. Afterwards, a DNN is trained on the translated images and its corresponding labels. On the other hand, Endo-Sim2Real [27] utilizes similarity-based joint learning from both simulation and real data under the assumption that the shape of an instrument remains consistent across domains as well as under semantic preserving perturbations (like adding pixel-level noise or transformations).

This work is in line with Endo-Sim2Real and focuses on end-to-end learning for unsupervised domain adaptation. However, we formalise the consistency-based UDA to identify the confirmation bias problem and unstable training of Endo-Sim2Real approach and address it by employing a teacher–student paradigm. This facilitates stable training of the DNN and enhances its performance generalization capability.


Our proposed teacher–student domain-adaptation approach (see Fig. 2) aims to bridge the domain gap between source (simulated) and target (real) data by aligning a DNN model to both domains. Given:

  • a source domain \(D_s = (X_s,Y_s)\) associated with a feature space \(|X_s|\) and a label space \(|Y_s|\) and containing \(n_s\) labeled samples \(\{ (x_i^s, y_i^s) \}_{i=1}^{n_s}\) where, \(x_i \in X_s\) and \(y_i \in Y_s\) denote the i-th pair of image and label data, respectively

  • a target domain \(D_t = (X_t)\) associated with a feature space \(|X_t|\) and a label space \(|Y_t|\) and containing \(n_t\) unlabeled samples \(\{ x_i^t \}_{i=1}^{n_t}\) where, \(x_i \in X_t\) denote the i-th image of the unlabeled data

the goal of unsupervised domain adaptation is to learn a DNN model that generalizes on the target domain \(D_t\). It is important to note that although the simulation and real endoscopic scene may appear similar, the label space between source- and target-domains generally differ (i.e. \(Y_s \ne Y_t\)), representing for example different organs or different instrument types. Since we are focusing on binary instrument segmentation, the label categories are twofold (i.e. \(Y_s = Y_t = \{ ``\hbox {instrument''}, ``\hbox {background''} \}\)). For the sake of simplicity, we refer to the source domain \(D_s\) as labeled simulated domain \(D_L^{Sim}\) and to the target domain \(D_t\) as unlabeled real domain \(D_{UL}^{Real}\).

Our proposed (and previous) approach learns by jointly minimizing the supervised loss \(L_{sl}\) for the labeled simulated data-pair as well as the consistency loss \(L_{cl}\) for the unlabeled real data. A core component of the joint learning approach is unsupervised consistency learning, where a supervisory signal is generated by enforcing the DNN \(f_{\theta }\) (parameterized with network weights \(\theta \)) to produce a consistent output for an unlabeled input x and its perturbed form \(\mathcal {P}(x)\).

$$\begin{aligned} \min _{\theta } \ \ \mathcal {L}_{sl} + \mathcal {L}_{cl} \ \ \big \{ \underbrace{f_{\theta } \big ( x \big )}_{\tilde{\mathbf{y }}}, f_{\theta } \big ( \mathcal {P}(x) \big ) \big \} \end{aligned}$$

Here in Eq. 1, the DNN prediction \(\tilde{\mathbf{y }}\) for unperturbed data x acts as a pseudo-label for perturbed data \(\mathcal {P}(x)\) to guide the learning process. Therefore, the Endo-Sim2Real scenario can be interpreted as a student-as-teacher approach where the DNN acts as both a teacher that produces pseudo-labels and a student that learns from these labels. Since the DNN predictions may be incorrect or noisy during training [17], this student-as-teacher approach leads to so-called the confirmation bias [29], which reinforces the student to overfit to the incorrect pseudo-labels generated by the teacher and prevents learning new information. This issue is especially prominent during early stages of the training, when the DNN still lacks the correct interpretation of the labels. If the unsupervised consistency loss (\(L_{cl}\)) outweighs the supervised loss (\(L_{sl}\)), the learning process is not effective and leads to a sub-optimal performance. Therefore, the consistency loss is typically employed with a temporal weighting function w(t) such that the DNN learns prominently from the supervised loss during the initial stages of the learning and gradually shifts towards unsupervised consistency learning in the later stages.

Although the temporal ramp-up weighting function in Endo-Sim2Real helps to reduce the effect of the confirmation bias during joint learning, the DNN still learns directly from the incorrect pseudo-labels generated by the teacher.

$$\begin{aligned}&\min _{\theta } \left\{ \underbrace{\sum _{(x_i,y_i) \in D_L^{\mathrm{Sim}}} \mathcal {L}_{sl} \left( f_{\theta } \left( x_i \right) ,y_i \right) }_{\mathrm{supervised\,simulation}}\right. \nonumber \\&\left. \quad + w(t) *\underbrace{ \sum _{x_i \in D_{\mathrm{UL}}^{\mathrm{Real}}} \mathcal {L}_{cl} \left( f_{\theta ^{'}} \left( x_i \right) , f_{\theta } \left( \mathcal {P}(x_i)\right) \right) }_{\mathrm{unsupervised\,real}} \right\} \nonumber \\&\theta _{t}^{'} = ( \alpha \cdot \theta _{t-1}^{'} + (1 - \alpha ) \cdot \theta _{t} ) \end{aligned}$$

In this work, we address this major drawback of the Endo-Sim2Real approach by improving the pseudo-label generation procedure of the unlabeled consistency learning. To this end, the teacher network is de-coupled from the student network and redefined (\(f_{\theta } \longrightarrow f_{\theta ^{'}}\)) to generate reliable targets to enable the student to gradually learn meaningful information about the instrument shape. In order to avoid separate training of the teacher model, the same architecture is used for the teacher and its parameters are updated as a temporal average [29] of the student network’s weights.

At each training step t, the student \(f_{\theta }\) is updated using gradient-descent while the teacher \(f_{\theta ^{'}}\) is updated using student network weights, where the smoothing factor \(\alpha \) controls the update rate of the teacher. A pseudo code of our proposed teacher–student learning approach is provided in Algorithm 1.

figure a
Table 1 List of source (simulation) and target (real) datasets used during evaluation, where [videos (#) | empty frames (%)] reflects the number of videos and percentage of frames with no instrument, respectively

Experimental setup


Simulation [23] data contain 20K rendered images acquired via 3-D laparoscopic simulations from the CT scans of 10 patients. The images describe a rendered view of a laparoscopic scene with each tissue having a distinct texture and a presence of two conventional surgical instruments (grasper and hook) under a random placement of the camera (coupled with a light source).

Cholec [27] data contain around 7K endoscopic video frames acquired from 15 videos of the Cholec80 dataset [31]. The images describe the laparoscopic cholecystectomy scene with seven conventional surgical instruments (grasper, hook, scissors, clipper, bipolar, irrigator and specimen bag). The data provide segmentations for each instrument type, however, the specimen bag is considered as a counterexample that is treated as background during evaluation, following the definition of an instrument in RobustMIS challenge [26].

EndoVis [1] data consist of 300 images from six different in-vivo 2D recordings of complete laparoscopic colorectal surgeries. The data provide binary segmentations of instruments for validation where images describe an endoscopic scene containing seven conventional instruments (including hook, traumatic grasper, ligasure, stapler, scissors and scalpel) [5].

RobustMIS [26] data consist of around 10K images acquired from 30 surgical procedures of three different types of colorectal surgery (10 rectal resection procedures, 10 proctocolectomy procedures and 10 procedures of sigmoid resection procedures). An instrument is defined as an elongated rigid object that is manipulated directly from outside the patient. Therefore, grasper, scalpel, clip applicator, hooks, stapling device, suction and even trocar is considered as an instrument while non-rigid tubes, bandages, compresses, needles, coagulation sponges, metal clips etc. are considered as counterexamples as they are indirectly manipulated from outside [26]. The data provide instance level segmentations for validation, which are performed in three different stages with an increasing domain gap between the training- and the test-data. Stage 1 contains video frames from 16 cases of the training data, stage 2 has video frames of two proctocolectomy and rectal surgeries each, and stage 3 has video frames from 10 sigmoid resection surgeries

It is important to note that the domain gap increases not only in the three stages of testing in Robust-MIS dataset, but also from Simulation towards Real datasets (EndoVis < Cholec < Robust-MIS) as the definition of instrument (and/or counterexample) changes along with other factors (Table 1).

Table 2 Quantitative comparison using DSC [mean (std)] [empty frames (%)] reflects the percentage of empty frames
Fig. 3
figure 3

Qualitative analysis on EndoVis dataset. The green color in the images represents the network predictions while the yellow color represents under-segmentation


We have redesigned the implementation of the Endo-Sim2Real framework in view of a teacher–student approach. To ensure a direct and fair comparison, we employ the same TerNaus11 [28] as a backbone segmentation model. Also, we utilize the best performing perturbation scheme (i.e. applying one of the pixel-intensity perturbationFootnote 2 followed by one of the pixel-corruption perturbationFootnote 3) and the loss function (i.e. cross-entropy and jaccard) of Endo-Sim2Real for evaluation. All simulated input images and labels are first pre-processed with a stochastically-varying circular outer mask to give them the appearance of real endoscopic images.

We use a batch size of 8 for 50 epochs and apply weight decay (\(1e-6\)) as standard regularization. During consistency training, we use a time-dependent weighting function, where the weight of the unlabeled loss term is linearly increased over the training. The teacher model is updated with \(\alpha \) (0.95) at each training step.

During evaluation of a dataset, we use an image-based dice score and average over all images to obtain a global dice metric for the dataset. For computation of the dice score, we exclude the cases where both the prediction and ground truth images are empty. However, we include cases with false positives for the empty images and set it to zero. So the dice score for empty ground-truth images (without any instrument) is either zero and considered in case of any false positives or undefined and not considered in case of correct prediction. Also, we report all the results as an average performance of three runs throughout our experiments.

Results and discussion

This section provides a quantitative comparison with respect to the state-of-the-art approaches to demonstrate the effectiveness of our approach. Moreover, we perform quantitative and qualitative analyses on three different datasets with varying degrees of the domain gap to highlight the challenges in simulation-to-real unsupervised domain adaptation. Particularly, we identify the failure modes with respect to specific cases and scenarios in order to provide valuable insights into addressing the remaining performance gap.

Fig. 4
figure 4

Visualization of the relation between tool co-occurrence and segmentation quality for the Cholec dataset. Please note that the dice score is zero for no tool cases and specimen bag as it is treated as background

Comparison with baseline and state-of-the-art

In these experiments, we first highlight the performance of the two baselines: the lower baseline (supervised learning purely on simulated data) and the upper baseline (supervised learning purely on annotated real data) in Table 2. The substantial performance gap between the baselines indicates the domain gap between simulated and real data. Secondly, we compare our proposed teacher–student approach with other unsupervised domain adaptation approaches, i.e. the domain style transfer approach (I2I) and the plain consistency-based joint learning approach (Endo-Sim2Real) on the Cholec dataset. The empirical results show that Endo-Sim2Real works similar to I2I, while our proposed approach outperforms both of these approaches. Later, we evaluate our approach on two additional datasets and show that it consistently outperforms Endo-Sim2Real. These experiments demonstrate that the generalization performance of the DNN can be enhanced by employing unsupervised consistency learning on unlabeled data. Finally, the performance gap with the upper baseline calls for identification of the issues needed to bridge the remaining domain gap.

Analysis on EndoVis

Among the three datasets, our proposed approach performs best for EndoVis as shown in Table 2. A visual analysis of the low performing cases in Fig. 3 highlights factors such as false detection on specular reflection, under-segmentation for small instruments, tool-tissue interaction and partially occluded instruments. These factors can in part be addressed by utilizing the temporal information of video frames [13].

Analysis on Cholec

We performed an extensive performance analysis of our proposed approach on the Cholec dataset as instrument-specific labels are available for it (in comparison to EndoVis). To understand the distinctive performance aspects for the Cholec dataset, we compare the segmentation performance across different instrument co-occurrence in Fig. 4. A similar range of dice scores highlights that the performance of our approach is less impacted by the presence of multiple tool combinations in an endoscopic image.

Table 3 Quantitative results for multi-instrument presence in RobustMIS dataset using DSC [mean (std)]

However, it also clearly shows that the segmentation performance of our approach drops when the specimen bag and its related co-occurrences are present (as seen in the respective box plots in Fig. 4). A visual analysis highlights false detection on the reflective surface of the specimen bag.

Fig. 5
figure 5

Qualitative analysis on the Cholec dataset. The green color in the images represents the network predictions while the yellow color represents under-segmentation

Apart from the previously analyzed performance degrading factors in the EndoVis dataset, other major factors affecting the performance are as follows:


Out of distribution cases such as a non-conventional tool-shape-like instrument: specimen bag (see box-plots for labelsets with specimen bag in Fig. 4).


False detection for scenarios such as an endoscopic view within the trocar, instrument(s) near the image border or under-segmentation for small instruments.


Artefact cases such as specular reflection. The impact of other artefacts such as blood, smoke or motion blur is lower.

Although our proposed approach struggles to tackle these artefacts and out of distribution cases, addressing these performance degrading factors is itself an open research problem [2].

Fig. 6
figure 6

Qualitative analysis on the RobustMIS dataset. The green color in the images represents the network predictions while the yellow color represents under-segmentation

Analysis on RobusMIS

We analyzed the performance of our approach on images with a different number of instruments in the RobustMIS dataset. We found that the performance is not significantly affected by the presence of multiple tools (see Table 3). A low performance for a single visible instrument is attributed to small, stand-alone instruments across image boundary. Apart from the factors in the Cholec dataset, other real-world performance degrading factors in RobustMIS include: presence of other out-of-distribution cases such as non-rigid tubes, bandages, needles etc.; presence of corner cases such as trocar-views and specular reflections producing a instrument-shape-like appearance. These failure cases highlight a drawback of our approach, which works under the assumption that the shape of the instrument remains consistent between the domains. Therefore, our approach may not be able to produce faithful predictions in case instruments with different shapes are encountered in the real domain (compared to instruments in simulation) or counterexamples with instrument like appearance.

Impact of empty ground-truth frames

The performance of our teacher–student approach is negatively affected by the video frames that do not contain instruments. This is because the dice score is assigned to zero when the network predicts false positives (as seen in Figs. 5 and 6) in instrument-free video frames. A direct relation of this effect can be seen in Table 2 where the dice score across the datasets decreases as the number of empty frames increases (in %) from EndoVis to RobustMIS. It suggests that utilizing false detection techniques in the current framework can help in enhancing the generalization capabilities.


We introduce teacher–student learning to address the confirmation bias issue of the EndoSim2Real consistency learning. This enables us to tackle the challenging problem of the domain shift between synthetic and real images for surgical tool segmentation in endoscopic videos. Our proposed approach enforces the teacher model to generate reliable targets to facilitate stable student learning. Since the teacher is a moving average model of the student, the extension does not add computational complexity to the current approach.

We show that the proposed teacher–student learning approach generalizes across three different datasets for the instrument segmentation task and consistently outperforms the previous state-of-the-art. For a majority of images (see high peak in Figs. 35 and 6), the segmentation predictions are usually correct with small variations across the instrument boundary. Moreover, a thorough analysis of the results highlight interpretable failure modes of simulation-to-real deep learning as the domain gap widens progressively.

Considering the strengths and limitations of our teacher–student enabled simulation-to-real unsupervised domain adaptation approach, the framework admits multiple straightforward extensions to bridge the remaining domain gap:


Implementing techniques to suppress false detection for empty frames, instruments near the image border and specular reflections, for instance by utilizing temporal information [13] of video frames.


Improving physical properties of simulation to capture instrument-tissue interaction, considering the variations in predictions across instrument boundaries.


Extension towards semi-supervised domain adaptation or real-to-real unsupervised domain adaptation by utilizing labels from target (real) data for the endoscopic instrument segmentation task.


Employing this approach in conjunction with other self-supervised or adversarial domain mapping approaches such as I2I [27].

Being flexible, end-to-end and unsupervised with respect to the target domain, our approach can be adapted to other imaging modalities or learning tasks which utilize joint learning from labeled and unlabeled data. For instance, it can be extended towards other domain-adaptation tasks, such as depth estimation [20, 24] or instrument pose estimation [3, 8] by exploiting depth maps from the simulated virtual environments.

The heavy reliance of current approaches on manual annotation and the harsh reality of surgeons sparing time for the annotation process propels simulation-to-real domain adaptation as the obvious problem to address in surgical data science. The proposed approach ushers annotation-efficient surgical data science for the operating room of the future.