Simulation-to-Real domain adaptation with teacher-student learning for endoscopic instrument segmentation

Purpose: Segmentation of surgical instruments in endoscopic videos is essential for automated surgical scene understanding and process modeling. However, relying on fully supervised deep learning for this task is challenging because manual annotation occupies valuable time of the clinical experts. Methods: We introduce a teacher-student learning approach that learns jointly from annotated simulation data and unlabeled real data to tackle the erroneous learning problem of the current consistency-based unsupervised domain adaptation framework. Results: Empirical results on three datasets highlight the effectiveness of the proposed framework over current approaches for the endoscopic instrument segmentation task. Additionally, we provide analysis of major factors affecting the performance on all datasets to highlight the strengths and failure modes of our approach. Conclusion: We show that our proposed approach can successfully exploit the unlabeled real endoscopic video frames and improve generalization performance over pure simulation-based training and the previous state-of-the-art. This takes us one step closer to effective segmentation of surgical tools in the annotation scarce setting.


Introduction
A faithful segmentation of surgical instruments in endoscopic videos is a crucial component of surgical scene understanding and realization of automation in computer-or robot-assisted intervention systems. 1 A majority of recent approaches address the problem of surgical instrument segmentation by training deep neural networks (DNNs) in a fully-supervised scheme. However, the applicability of such supervised approaches is restricted by the availability of a sufficiently large amount of real videos with clean annotations. The annotation process (especially pixel-wise) can be prohibitively expensive (see Figure 1) because it takes valuable time of medical experts.
An alternative direction to mitigate the dependency on annotated video sequences is to utilize synthetic data for the training of DNNs. Recent advances in graphics and simulation infrastructures have paved the way to automatically create a large number of photo-realistic simulated images with accurate pixel-level labels. 14;23 However, the DNNs trained purely on simulated images do not generalize well on real endoscopic videos due to domain shift issue. 30;32 We hypothesize that a DNN's bias towards recognizing textures rather than shapes 4 results in a significant drop of performance when the DNNs are trained on simulation (rendered) data and applied to real environments. This is mainly because the heterogeneity of information within a real surgical scene is heavily influenced by factors such as lighting conditions, motion blur, blood, smoke, specular reflection, noise etc. However, simulation data only mimic shapes of instrument and patient-specific organs. 10 To alleviate this performance generalization issue, domain adaptation approaches 33 are utilized to address domain shift/mismatch.
When the annotations for the target domain are not available for learning, domain adaptation is applied in an unsupervised manner, 34 where enabling the DNN to learn domain-invariant features is the primary goal of a learning algorithm. For instance, Pfeiffer et al. 23 utilized an image-to-image translation approach, where the simulated images are translated into realistic looking ones by mapping image styles (texture, lighting) of the real data using a Cycle-GAN. In contrast, we argued for a shape-focused joint learning from simulated and real data in an end-to-end fashion and introduced a consistency-learning based approach Endo-Sim2Real 27 to align the DNNs on both domains. We showed that similar performance can be  obtained by employing a non-adversarial approach while improving the computational efficiency (with respect to training time). However, similar to perturbationbased consistency learning approaches, 17;11 Endo-Sim2Real suffers from so-called confirmation bias. 29 This is caused by noise accumulation or erroneous learning during the training stages, which may result in a degenerate solution. 6 In this work, we formalise the consistency-based unsupervised domain adaptation framework to identify the confirmation bias problem of Endo-Sim2Real and propose a teacher-student learning paradigm to address this problem. Our proposed approach tackles the erroneous learning by improving the pseudo-label generation procedure for the unlabeled data and facilitate stable training of DNNs while maintaining computational efficiency. Through quantitative and qualitative analysis, we show that our proposed approach outperforms the previous Endo-Sim2Real approach across three data sets. Moreover, the proposed approach leads to a stable training without loosing computational efficiency. Our contribution in this article is two-fold: 1. We introduce the teacher-student learning paradigm to the task of surgical instrument segmentation in endoscopic videos. 2. We provide a thorough quantitative and qualitative analysis to show the failure modes of our unsupervised domain adaptation approach. Our analysis leads to a better understanding of the strengths and challenges of consistency-based unsupervised learning with simulation-based supervision.

Related Work
Research on instrument segmentation for endoscopic procedures is dominated by supervision-based approaches ranging from full supervision 5 , semi/self supervision 25 , and weak supervision 12 up to multi-task 16 and multi-modal learning 15 . Some recent works also explored unsupervised approaches, 7;18 however, for the sake of brevity, we will only focus on approaches that employ learning from simulation data for unsupervised domain adaptation. Within the context of domain adaptation in surgical domains, Mahmood et. al. 20 proposed an adversarial-based transformer network to translate a real image to a synthetic image such that a depth estimation model trained on synthetic images can be applied to the real image. On the other hand, Rau et. al. 24 proposed a conditional Generative Adversarial Network (GAN) based approach to estimate depth directly from real images. Other works have argued for translating synthetic images to photo-realistic images by using domain mapping via style transfer 19;21 , for instance by using Cycle-GAN based unpaired image-to-image translation 9;22 and utilize annotations from synthetic environment for deep learning tasks.
Pfeiffer et. al. 23 proposed an unpaired image-to-image translation approach I2I that focuses on reducing the distribution difference between the source and the target domain by employing a Cycle-GAN based style transfer. Afterwards, a DNN is trained on the the translated images and its corresponding labels. On the other hand, Endo-Sim2Real 27 utilizes similarity-based joint learning from both simulation and real data under the assumption that the shape of an instrument remains consistent across domains as well as under semantic preserving perturbations (like adding pixel-level noise or transformations). This work is in line with Endo-Sim2Real and focuses on end-to-end learning for unsupervised domain adaptation (UDA). However, we formalise the consistencybased UDA to identify the confirmation bias problem and unstable training of Endo-Sim2Real approach and address it by employing a teacher-student paradigm. This facilitates stable training of the DNN and enhances its performance generalization capability.

Method
Our proposed teacher-student domain-adaptation approach (see Figure 2) aims to bridge the domain gap between source (simulated) and target (real) data by aligning a DNN model to both domains. Given: • a source domain D s = (X s , Y s ) associated with a feature space X s and a label space Y s and containing n s labeled samples where, x i ∈ X s and y i ∈ Y s denote the i-th pair of image and label data respectively • a target domain D t = (X t ) associated with a feature space |X t | and a label space |Y t | and containing n t unlabeled samples where, x i ∈ X t denote the i-th image of the unlabeled data the goal of unsupervised domain adaptation is to learn a DNN model that generalizes on the target domain D t . It is important to note that although the simulation and real endoscopic scene may appear similar, the label space between source-and target-domains generally differ (i.e. Y s = Y t ), representing for example different organs or different instrument types. Since we are focusing on binary instrument segmentation, the label categories are twofold (i.e. Y s = Y t = {"instrument", "background"}). For the sake of simplicity, we refer to the source domain D s as labeled simulated domain D Sim L and to the target domain D t as unlabeled real domain D Real U L . Our proposed (and previous) approach learns by jointly minimizing the supervised loss L sl for the labeled simulated data-pair as well as the consistency loss L cl for the unlabeled real data. A core component of the joint learning approach is unsupervised consistency learning, where a supervisory signal is generated by enforcing the DNN f θ (parameterized with network weights θ) to produce a consistent output for an unlabeled input x and its perturbed form P(x).
Here in Equation 1, the DNN predictionỹ for unperturbed data x acts as a pseudo-label for perturbed data P(x) to guide the learning process. Therefore, the Endo-Sim2Real scenario can be interpreted as a student-as-teacher approach where the DNN acts as both a teacher that produces pseudo-labels and a student that learns from these labels. Since the DNN predictions may be incorrect or noisy during training, 17 this student-as-teacher approach leads to so-called the confirmation bias, 29 which reinforces the student to overfit to the incorrect pseudolabels generated by the teacher and prevents learning new information. This issue is especially prominent during early stages of the training, when the DNN still lacks the correct interpretation of the labels. If the unsupervised consistency loss (L cl ) outweighs the supervised loss (L sl ), the learning process is not effective and leads to a sub-optimal performance. Therefore, the consistency loss is typically employed with a temporal weighting function w(t) such that the DNN learns prominently from the supervised loss during the initial stages of the learning and gradually shifts towards unsupervised consistency learning in the later stages.
Although the temporal ramp-up weighting function in Endo-Sim2Real helps to reduce the effect of the confirmation bias during joint learning, the DNN still learns directly from the incorrect pseudo-labels generated by the teacher.
In this work, we address this major drawback of the Endo-Sim2Real approach by improving the pseudo-label generation procedure of the unlabeled consistency learning. To this end, the teacher network is de-coupled from the student network and redefined (f θ −→ f θ ) to generate reliable targets to enable the student to gradually learn meaningful information about the instrument shape. In order to avoid separate training of the teacher model, the same architecture is used for the teacher and its parameters are updated as a temporal average 29 of the student network's weights.
At each training step t, the student f θ is updated using gradient-descent while the teacher f θ is updated using student network weights, where the smoothing factor α controls the update rate of the teacher. A pseudo code of our proposed teacher-student learning approach is provided in Algorithm 1.

Data
Simulation 23 data contain 20K rendered images acquired via 3-D laparoscopic simulations from the CT scans of 10 patients. The images describe a rendered view of a laparoscopic scene with each tissue having a distinct texture and a presence of two conventional surgical instruments (grasper and hook) under a random placement of the camera (coupled with a light source).
Cholec 27 data contain around 7K endoscopic video frames acquired from 15 videos of the Cholec80 dataset. 31 The images describe the laparoscopic cholecystectomy scene with seven conventional surgical instruments (grasper, hook, scissors, clipper, bipolar, irrigator and specimen bag). The data provide segmentations for each instrument type, however, the specimen bag is considered as a counterexample that is treated as background during evaluation, following the definition of an instrument in RobustMIS challenge. 26 EndoVis 1 data consist of 300 images from six different in-vivo 2D recordings of complete laparoscopic colorectal surgeries. The data provide binary segmentations of instruments for validation where images describe an endoscopic scene containing seven conventional instruments (including hook, traumatic grasper, ligasure, stapler, scissors and scalpel). 5 RobustMIS 26 data consist of around 10K images acquired from 30 surgical procedures of three different types of colorectal surgery (10 rectal resection procedures, 10 proctocolectomy procedures and 10 procedures of sigmoid resection procedures). An instrument is defined as an elongated rigid object that is manipulated directly from outside the patient. Therefore, grasper, scalpel, clip applicator, hooks, stapling device, suction and even trocar is considered as an instrument while non-rigid tubes, bandages, compresses, needles, coagulation sponges, metal clips etc. are considered as counterexamples as they are indirectly manipulated from outside. 26 The data provide instance level segmentations for validation, which are performed in three different stages with an increasing domain gap between the training-and the test-data. Stage 1 contains video frames from 16 cases of the training data, stage 2 has video frames of two proctocolectomy and rectal surgeries each, and stage 3 has video frames from 10 sigmoid resection surgeries.
It is important to note that the domain gap increases not only in the three stages of testing in Robust-MIS dataset, but also from Simulation towards Real datasets (EndoVis < Cholec < Robust-MIS) as the definition of instrument (and/or counterexample) changes along with other factors.

Implementation
We have redesigned the implementation of the Endo-Sim2Real framework in view of a teacher-student approach. To ensure a direct and fair comparison, we employ the same TerNaus11 28 as a backbone segmentation model. Also, we utilize the best performing perturbation scheme (i.e. applying one of the pixel-intensity perturbation 2 followed by one of the pixel-corruption perturbation 3 ) and the loss function (i.e. cross-entropy and jaccard ) of Endo-Sim2Real for evaluation. All simulated input images and labels are first pre-processed with a stochastically-varying circular outer mask to give them the appearance of real endoscopic images.
2 pixel-intensity: random brightness and contrast shift, posterisation, solarisation, random gamma shift, random HSV color space shift, histogram equalization and contrast limited adaptive histogram equalization 3 pixel-corruption: gaussian noise, motion blurring, image compression, dropout, random fog simulation and image embossing We use a batch size of 8 for 50 epochs and apply weight decay (1e − 6) as standard regularization. During consistency training, we use a time-dependent weighting function, where the weight of the unlabeled loss term is linearly increased over the training. The teacher model is updated with α (0.95) at each training step.
During evaluation of a dataset, we use an image-based dice score and average over all images to obtain a global dice metric for the dataset. For computation of the dice score, we exclude the cases where both the prediction and ground truth images are empty. However, we include cases with false positives for the empty images and set it to zero. So the dice score for empty ground-truth images (without any instrument) is either zero and considered in case of any false positives or undefined and not considered in case of correct prediction. Also, we report all the results as an average performance of three runs throughout our experiments.

Results and Discussion
This section provides a quantitative comparison with respect to the state-of-theart approaches to demonstrate the effectiveness of our approach. Moreover, we perform a quantitative and qualitative analysis on three different datasets with varying degrees of the domain gap. This shows the strengths and weaknesses of our approach in order to better understand the challenges and provide valuable insights into addressing the remaining performance gap.

Comparison with baseline and state-of-the-art
In these experiments, we first highlight the performance of the two baselines: the lower baseline (supervised learning purely on simulated data) and the upper baseline (supervised learning purely on annotated real data) in Table 2. The substantial performance gap between the baselines indicates the domain gap between simulated and real data. Secondly, we compare our proposed teacher-student approach with other unsupervised domain adaptation approaches, i.e. the domain style transfer approach (I2I ) and the plain consistency-based joint learning approach (Endo-Sim2Real ) on the Cholec dataset. The empirical results show that Endo-Sim2Real works similar to I2I, while our proposed approach outperforms both of these approaches. Later, we evaluate our approach on two additional datasets and show that it consistently outperforms Endo-Sim2Real. These experiments demonstrate that the generalization performance of the DNN can be enhanced by employing unsupervised consistency learning on unlabeled data. Finally, the performance gap with the upper baseline calls for identification of the issues needed to bridge the remaining domain gap.

Analysis on EndoVis
Among the three datasets, our proposed approach performs best for EndoVis as shown in Table 2. A visual analysis of the low performing cases in Figure 3 highlights factors such as false detection on specular reflection, under-segmentation for small instruments, tool-tissue interaction and partially occluded instruments. These factors can in part be addressed by utilizing the temporal information of video frames.   4: Visualization of the relation between tool co-occurrence and segmentation quality for the Cholec dataset. Please note that the dice score is zero for no tool cases and specimen bag as it is treated as background.

Analysis on Cholec
We performed an extensive performance analysis of our proposed approach on the Cholec dataset as instrument-specific labels are available for it (in comparison to EndoVis). To understand the distinctive performance aspects for the Cholec dataset, we compare the segmentation performance across different instrument co-occurrence in Figure 4. A similar range of dice scores highlights that the performance of our approach is less impacted by the presence of multiple tool combinations in an endoscopic image. However, it also clearly shows that the segmentation performance of our approach drops when the specimen bag and its related cooccurrences are present (as seen in the respective box plots in Figure 4). A visual analysis highlights false detection on the reflective surface of the specimen bag. Apart from the previously analyzed performance degrading factors in the En-doVis dataset, other major factors affecting the performance are as follows: * Out of distribution cases such as a non-conventional tool-shape-like instrument: specimen bag (see box-plots for labelsets with specimen bag in Figure 4). * False detection for scenarios such as an endoscopic view within the trocar, instrument(s) near the image border or under-segmentation for small instruments. * Artefact cases such as specular reflection. The impact of other artefacts such as blood, smoke or motion blur is lower.
Although our proposed approach struggles to tackle these artefacts and out of distribution cases, addressing these performance degrading factors is itself an open research problem.

Analysis on RobusMIS
We analyzed the performance of our approach on images with a different number of instruments in the RobustMIS dataset. We found that the performance is not significantly affected by the presence of multiple tools (see Table 3). A low performance for a single visible instrument is attributed to small, stand-alone instruments across image boundary. Apart from the factors in the Cholec dataset, other real-world performance degrading factors in RobustMIS include: presence of other out-of-distribution cases such as non-rigid tubes, bandages, needles etc.; presence of corner cases such as trocar-views and specular reflections producing a instrument-shape-like appearance. These failure cases highlight a drawback of our approach, which works under the assumption that the shape of the instrument remains consistent between the domains. Therefore, our approach may not be able to produce faithful predictions in case instruments with different shapes are encountered in the real domain (compared to instruments in simulation) or counterexamples with instrument like appearance.

Impact of empty ground-truth frames
The performance of our teacher-student approach is negatively affected by the video frames that do not contain instruments. This is because the dice score is assigned to zero when the network predicts false positives (as seen in Figure 5 and 6) in instrument-free video frames. A direct relation of this effect can be seen in Table 2 where the dice score across the datasets decreases as the number of empty frames increases (in %) from EndoVis to RobustMIS. It suggests that utilizing false detection techniques in the current framework can help in enhancing the generalization capabilities.

Conclusion
We introduce teacher-student learning to address the confirmation bias issue of the EndoSim2Real consistency learning. This enables us to tackle the challenging problem of the domain shift between synthetic and real images for surgical tool segmentation in endoscopic videos. Our proposed approach enforces the teacher model to generate reliable targets to facilitate stable student learning. Since the teacher is a moving average model of the student, the extension does not add computational complexity to the current approach. We show that the proposed teacher-student learning approach generalizes across three different datasets for the instrument segmentation task and consistently outperforms the previous state-of-the-art. For a majority of images (see high peak in Figure 3, 5 and 6), the segmentation predictions are usually correct with small variations across the instrument boundary. Moreover, a thorough analysis of the results highlight interpretable failure modes of simulation-to-real deep learning as the domain gap widens progressively.
Considering the strengths and weaknesses of our teacher-student enabled consistencybased unsupervised domain adaptation approach, the framework admits multiple straightforward extensions: * Implementing techniques to suppress false detection for empty frames, instruments near the image border and specular reflections, for instance by utilizing temporal information 13 of video frames. * Improving physical properties of simulation to capture instrument-tissue interaction, considering the variations in predictions across instrument boundaries. * Extension towards other domain-adaptation-tasks, for instance depth estimation 20;24 or instrument pose estimation 8;3 by exploiting depth maps from the simulated virtual environments. * Extension towards semi-supervised domain adaptation or real-to-real unsupervised domain adaptation by joint learning from labeled and unlabeled data. * Utilizing this approach on top of self-supervised or adversarial domain mapping approaches such as I2I.
The heavy reliance of current approaches on manual annotation and the harsh reality of surgeons sparing time for the annotation process propels simulation-toreal domain adaptation as the obvious problem to address in surgical data science. The proposed approach ushers annotation-efficient surgical data science for the operating room of the future.

Author Statement
Conflict of interest: The authors state no conflict of interest.
Informed consent: This study contains patient data from a publicly available dataset.
Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.
35. Zhang Y, David P, Gong B (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2020-2030