Self-Supervised Domain Adaptation for Patient-Specific, Real-Time Tissue Tracking

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12263)


Estimating tissue motion is crucial to provide automatic motion stabilization and guidance during surgery. However, endoscopic images often lack distinctive features and fine tissue deformation can only be captured with dense tracking methods like optical flow. To achieve high accuracy at high processing rates, we propose fine-tuning of a fast optical flow model to an unlabeled patient-specific image domain. We adopt multiple strategies to achieve unsupervised fine-tuning. First, we utilize a teacher-student approach to transfer knowledge from a slow but accurate teacher model to a fast student model. Secondly, we develop self-supervised tasks where the model is encouraged to learn from different but related examples. Comparisons with out-of-the-box models show that our method achieves significantly better results. Our experiments uncover the effects of different task combinations. We demonstrate that unsupervised fine-tuning can improve the performance of CNN-based tissue tracking and opens up a promising future direction.


Patient-specific models Motion estimation Endoscopic surgery 

1 Introduction

In (robot-assisted) minimally invasive surgery, instruments are inserted through small incisions and observed by video endoscopy. Remote control of instruments is a complex task and requires a trained operator. Computer assistance/guidance during surgery - in the form of automatic interpretation of endoscopic images using tracking and pose estimation - not only enables robotic surgery but can also help surgeons to operate more safely and precisely. Despite the maturity of many computer vision methods, visual motion estimation of moving tissue is still a challenging problem. Accurate motion estimation at fast feedback rates, is critical for intra-operative guidance and patient safety.
Fig. 1.

Tracking liver tissue with very sparse texture and respiratory motion for 280 consecutive frames. Additional “regularization” in teacher-student fine-tuning improves smoothness of tracking grid, i.e. less drift on sparse tissue surface.

Visual motion can be estimated with sparse tracking, e.g. based on feature matching, or dense tracking algorithms like optical flow estimation. Because tissue deformation can only be fully captured with dense tracking, we focus this work on motion estimation with dense optical flow (OF). Traditionally, OF algorithms are based on conventional image processing with engineered features (an overview is provided in [19]). We will further focus on end-to-end deep learning models [5, 13, 15, 21, 22, 24], as these outperform the conventional methods on public OF benchmarks [3, 7, 16]. Unfortunately, these models show the common speed vs. accuracy trade-off. High accuracy is achieved by high complexity, which leads to slow processing rates. On the other hand, faster models lack the capability to generalize well and provide lower accuracy. The goal is simultaneous fast and accurate flow estimation.

In a previous work [12], we propose patient-specific fine-tuning of a fast OF model based on an unsupervised teacher-student approach. A high-accuracy, but slow teacher model (FlowNet2 @12 fps [13]) is used to compute pseudo-labels for endoscopic scenes, which can then be used to fine-tune a fast, but imprecise student model (FlowNet2S @80 fps [13]) to a patient-specific domain1. Supervised fine-tuning with pseudo-labels improved the accuracy of the student model on the patient-specific target domain drastically at a speed up of factor 6, results are shown in Fig. 1. However, there are two drawbacks to this method. First, at best, the student can only become as good as the teacher. Second, the assumption that the teacher model provides good labels for the patient-specific endoscopic scene might be overly optimistic.

In this work, we propose a significantly extended method for fast patient-specific optical flow estimation. Our work draws inspiration from two research fields. First, in line with our previous work we utilize a teacher-student approach. Teacher-student approaches are found in model compression and knowledge distillation [2, 10] and have attracted attention in other areas such as domain adaptation [6]. Second, inspired by self-supervised learning [4] we design additional optical flow tasks for the model to solve. This enables us to train on unlabeled data in a supervised manner. Joining both ideas, we propose supervised fine-tuning on pseudo-labels computed from real image pairs, synthetic image pairs created from pseudo motion fields and real image pairs with real (simplified) motion fields. We further apply augmentation to the input images during fine-tuning to pose a higher challenge for the student model. Increasing difficulty for the student is commonly used in classification to boost performance.

Our contributions are as follows: A completely unsupervised fine-tuning strategy for endoscopic sequences without labels. An improved teacher-student approach (student can outperform teacher) and novel self-supervised optical flow tasks for use during unsupervised fine-tuning.

In the following section we will explain our method. Then we describe the experimental setup and implementation details that are followed by our results. The paper concludes with a discussion and an outlook.

2 Related Work

Tissue tracking is a challenging task due to often-times sparse texture and poor lighting conditions, especially regarding stable, long-term tracking or real-time processing [23]. The issue of sparse texture was recently tackled for ocular endoscopy, by supervised fine-tuning FlowNet2S to retinal images with virtual, affine movement [9]. This was the first time the authors obtained an estimation accuracy sufficient for successfully mosaicking small scans to retinal panoramas. In our pre-experiments, a fine-tuned FlowNet2S with virtual, affine motion was not able to track tissue deformations, which are more complex the affine motion. Unsupervised training based on UnFlow [15], which does not have the restriction of a simplified motion model, did not converge on sparse textured image pairs in our experiments. In 2018, Armin et al. proposed an unsupervised method (EndoReg) to learn interframe correspondences for endoscopic images [1]. However, we have shown that both FlowNet2 and FlowNet2S outperform EndoReg, achieving higher structural similarity indices (SSI) [12].

3 Self-Supervised Teacher-Student Domain Adaptation

Our goal is fast and accurate OF estimation of a patient-specific, endoscopic target domain \(T = \big (X, Y\big )\) with sample \(\big ((x_1, x_2), y\big ) \in T\) - the tuple \((x_1, x_2) \in X\) being the input image pair sample and \(y \in Y\) the corresponding motion field. Our fast student model has been trained for OF estimation on a source domain S which is disjoint from T. Unfortunately, as we don’t have labels, true samples from the target domain are unknown, so we cannot directly deploy supervised domain adaptation. Our aim is therefore to obtain good approximations of target samples. First, we will revisit optical flow estimation and then introduce our proposed three training schemes: pseudo-labels, teacher-warp and zero-flow.
Fig. 2.

Overview of method and training tasks to fine-tune a student model. a) A teacher model is used to produce pseudo labels \(\tilde{y}\). b) The student can train on these pseudo labels. c) Teacher Warp: \(x_1\) is warped using \(\tilde{y}\) to create a pseudo image pair with real label. d) Zero-flow: Given the same image twice, the real flow is 0. Image augmentation (au) can be applied to increase the difficulty for the student model.

3.1 Optical Flow Estimation

Optical flow (OF) estimation is a regression problem to determine visible motion between two images. The displacement of corresponding pixels is interpreted as a vector field describing the movement of each image component [11].

The primary assumption is brightness consistency. If we deploy the motion field y as a mapping function \(y:x_1 \rightarrow x_2\) to map the pixel values of \(x_1\), we obtain a reconstruction \(y(x_1)\) that is identical to \(x_2\) (neglecting occlusion effects between the images). In practice, an estimated flow field \(\hat{y}\approx y\) is generally not a perfect mapping. The difference in pixel values of \(x_2\) and its reconstruction \(\hat{x}_2 = \hat{y}(x_1)\) can be used to asses the quality of \(\hat{y}\).

Another assumption for OF is cycle consistency, in other words the concatenation of a forward mapping \(y:x_1 \rightarrow x_2\) and its reverse mapping \(y^{-1}:x_2 \rightarrow x_1\) cancel each other out and result in zero motion: \(x_1 = y^{-1}\big (y(x_1) \big )\).

3.2 Teacher-Student Domain Adaptation

Our teacher-student knowledge transfer approach requires a fast student model \(g_\phi :(x_1,x_2)\rightarrow \hat{y}\) parameterized with \(\phi \) and an accurate teacher model h with fixed parameters. We optimize the student model with the following supervised objective function
$$\begin{aligned} \arg \min _\phi \mathcal {L} \big (g_\phi (x_1, x_2), y\big ) \end{aligned}$$
Because y is not known, we propose three different approximations for self-supervised domain adaptation. An overview is provided in Fig. 2.
Pseudo Labels. The first approach is an approximation \(\tilde{y}\) of the motion field \(y:x_1 \rightarrow x_2\) with the teacher model \(h(x_1, x_2) = \tilde{y} \approx y\). This approach is the baseline introduced in our previous work. It combines real image pairs with approximated motion labels, resulting in the following loss function (approximation is underlined):
$$\begin{aligned} \mathcal {L} \big (g_\phi (x_1, x_2), \underline{h(x_1, x_2)} \big ) = \mathcal {L}\big (g_\phi (x_1, x_2), \tilde{y} \big ) \approx \mathcal {L}\big (g_\phi (x_1, x_2), y)\big ). \end{aligned}$$
Teacher Warp. The second approach is the generation of synthetic (pseudo) image pairs from known motion fields. We tried random motion fields, however, this leads the student to learn erroneous motion manifolds. We therefore use the teacher’s motion fields to generate image pairs \(\tilde{y}: x_1 \rightarrow \tilde{x}_2\). Note that, \(\tilde{y}\) is not an approximation but a perfect label for pair \((x_1,\tilde{x}_2)\), leading to
$$\begin{aligned} \mathcal {L} \big (g_\phi (\underline{x_1, \tilde{x}_2}), h(x_1, x_2) \big ) = \mathcal {L}\big (g_\phi (x_1, \tilde{x}_2), \tilde{y} \big ) = \mathcal {L}\big (g_\phi (x_1, \tilde{x}_2), y\big ). \end{aligned}$$
Zero-Flow Regularization. Our third and final approach is the combination of real images with real motion. This can be achieved with Zero-Flow, in other words \(x_1 = x_2\) and \(y=\mathbf {0}\), leading to the following loss
$$\begin{aligned} \mathcal {L}\big (g_\phi (x_1, x_1), \mathbf {0}\big ). \end{aligned}$$
Due to the high simplification of these samples, this loss is not suitable to be used as a sole training loss but can be used in combination with the other losses as a form of regularization (or minimum requirement of OF estimation).

Chromatic Augmentation. To increase the learning challenge for the student, we globally alter contrast \(\gamma \) and brightness \(\beta \) of \(x_v' = \gamma \cdot x_v + \beta \) and add Gaussian noise n for each pixel individually (applied to value channel in HSV colour space).

4 Experiments

In our experiments we compare the proposed self-supervised domain adaptation techniques introduced in Sect. 3: teacher labels, teacher warp, zero-flow. We test the approaches individually, as well as combinations. For combinations, loss functions are summed with equal weights. All combinations were tested with and without image augmentation. A list of combinations can be seen in Table 1. We do not focus on occlusion, therefore no occlusion handling is implemented.
Table 1.

Average cycle consistency error (CCE), average relative end point error EPE*. Best EPE* is achieved with a combined training scheme of teacher labels, teacher warp and zero-flow. CCE for training schemes including zero-flow is not conclusive and is therefore grayed out. (\(\dagger \) not adapted to medical image domain; +F: after fine tuning)

4.1 Setup

For our experiments we follow a similar setup to our previous work [12]. The main difference is the extension of the loss function and corresponding variations of the training samples that were added to the datasets.

Datasets. For our experiments we use four endoscopic video sequences from the Hamlyn datasets: scale and rotation: both in vivo porcine abdomen [17], sparse texture (and respiratory motion): in vivo porcine liver [17], strong deformation from tool interaction (and low texture): In vivo lung lobectomy procedure [8]. All sequences show specular highlights. The sparse-texture sequence shows very challenging lighting conditions. Each dataset represents an individual patient. To simulate intra-operative sampling from a patient-specific target domain we split the sequences into disjoint sets. Training and validation set represent the sampling phase (preparation stage), while the application phase during intervention is represented by the test set. The splits of the subsets were chosen manually, so that training and test data both cover dataset-specific motion. The number of samples for training/validation/test set are: scale - 600/240/397; rotation - 329/110/161; sparse - 399/100/307; deformation - 600/200/100. The left camera was used for all datasets.

Model. For all our experiments we used the accurate, high-complexity FlowNet2 framework [13] running at approx. 12 fps as our teacher model \(h\) and its fast FlowNet2S component running at approx. 80 fps as our low-complexity, fast student model \(g\). FlowNet2 achieves very high accuracy on the Sintel benchmark [3], which includes elastic deformation, specular highlights and motion blur, which are common effects in endoscopy. FlowNet2S has been shown to handle sparse textures well [9]. We utilized the implementation and pretrained models from [18].
Fig. 3.

Deformation dataset: Tracking results after 80 frames. The tissue is undergoing strong deformation due to tool interaction. Almost all fine-tuned models show very comparable results to the teacher FlowNet2. Unfortunately, the teacher shows drift in the bottom right and the centre of the tracked grid. Our model fine-tuned with added image augmentation produces more robust tracking results in this area. Interestingly, adding image augmentation to any other fine-tuning scheme did not improve the tracking results (also for different augmentation parameters we tried).

Training and Augmentation Parameters: For training we used Adam optimizer with a constant learning rate of 1e−4. Batch size was set to 8. Training was performed until convergence of validation loss, maximum of 120 epochs. Extension and augmentation of the training data set increases training times. This was not focus of this work but should be addressed in future work. We follow Sun et al.’s recommendation to fine-tune FlowNet2S with the multi-scale L1-loss [13, 20]. For illumination augmentation, we sampled brightness \(\beta \sim \mathcal {N}(0,3)\) and contrast \(\gamma \sim \mathcal {N}(1,0.15)\) for each image individually. Noise n was sampled from \(\mathcal {N}(0,2)\). Value range of input images was between 0 and 255. No normalization was performed on input images.

4.2 Results

We provide the relative endpoint error (\(\text {EPE}^* = ||(g(x_1,x_2)- \tilde{y})||_2\)) as well as the cycle consistency error (\(\text {CCE} = ||\mathbf {p} - \hat{y}^{-1}\big (\hat{y}(\mathbf {p}) \big )||_2\), \(\mathbf {p}\): image coordinates) for all our experiments in Table 1. We compare our fine-tuned models with the teacher and the student model, as well as our baseline (only teacher labels). All results were obtained using the test sets described in Sect. 4.1.

The best EPE* is achieved with a combined training scheme of teacher labels, teacher warp and zero-flow. The best EPE* is not necessarily indicating the best model though. A low EPE* indicates high similarity to the teacher’s estimation. Almost all fine-tuned models achieve a better cycle consistency than the original student model which indicates an improved model on the target domain. A low CCE, however, is not a guarantee for a good model, best CCE is achieved by estimating zero-flow (which is not necessarily a correct estimation).

Due to the lack of annotated ground truth, we also evaluate accuracy with a tracking algorithm over consecutive frames. Small errors between two frames are not visible. However, during tracking, small errors add up for consecutive frames, resulting in drift and making small errors visible over time. We see that in most cases the advanced domain adaptation improved tracking qualities compared to the baseline, see Figs. 1, 3, 4, and 5. In some cases it even outperforms the teacher model FlowNet2. Detailed analysis is provided in the captions. For Videos see supplementary material.
Fig. 4.

Rotation dataset: Tracking results after 160 frames. The orientation of the tissue changes due to camera rotation. In contrary to Fig. 3 the tracking results come closer to the teacher model by the extended fine-tuning schemes, while the image augmentation breaks the tracking performance. The jump of the average CCE for this combination is comparably high, which might be a way to detect such erroneous models (subject to future work). Change of orientation generally seems to be a more difficult task then scale change, presumably due the lack of rotational invariance of convolutions.

Fig. 5.

Scale dataset: Tracking results after 396 frames. The scale of the tissue changes due to camera going in and out of the situs. The tissue becomes lighter with higher proximity to the camera). This sequence contains sections with temporary, static camera pose. Interestingly, drift predominantly occurs during low motion sections. The zero-flow fine-tuning scheme improves the smoothness of the tracking grid. Image augmentation reduces drift in the bottom of the mesh. However, for the upper part of the mesh, drift is lower for all other models. Please note the extended length of this sequence compared to the other sequences.

Overall, adding teacher-warp to the training samples always improved the EPE*, however it should not be used on its own. As expected, training with added zero-flow almost always improved cycle consistency. Adding image augmentation seems to enforce conservative estimation. In Fig. 3 and 4 the model predicted less motion than all other training schemes. This may be beneficial in some cases e.g. very noisy data, however, the hyper-parameters for chromatic augmentation need to be chosen carefully to match the target domain.

5 Conclusion

We proposed advanced, self-supervised domain adaptation methods for tissue tracking based on a teacher-student approach. This tackles the problem of lacking annotations for real tissue motion to train fast OF networks. The advanced methods improve the baseline in many cases, in some cases even outperform the complex teacher model.

Further studies are required to determine a best overall strategy. We plan to extend the training scheme with further domain knowledge. Pseudo image pairs can be created using affine transformation, which is equivalent to synthetic camera movement. Cycle consistency can and has been used for training and in estimating occlusion maps [15]. DDF-FLow learns occlusion maps using an unsupervised teacher-student approach [14]. This seems like a very promising extension for our approach. Occlusion maps are an essential next step for safe tissue tracking.


  1. 1.

    Training samples should at best be identical to application samples. We therefore also propose to obtain training samples directly prior to the surgical intervention in the operation room. Intra-operative training time was on average 15 min.



This work has received funding from the European Union as being part of the EFRE OPhonLas project.


  1. 1.
    Armin, M.A., Barnes, N., Khan, S., Liu, M., Grimpen, F., Salvado, O.: Unsupervised learning of endoscopy video frames’ correspondences from global and local transformation. In: Stoyanov, D., et al. (eds.) CARE/CLIP/OR 2.0/ISIC -2018. LNCS, vol. 11041, pp. 108–117. Springer, Cham (2018). Scholar
  2. 2.
    Buciluă, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: ACM SIGKDD, pp. 535–541 (2006)Google Scholar
  3. 3.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: IEEE ECCV, pp. 611–625 (2012).
  4. 4.
    Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: IEEE ICCV, pp. 2051–2060 (2017)Google Scholar
  5. 5.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE ICCV (2015).
  6. 6.
    French, G., Mackiewicz, M., Fisher, M.: Self-ensembling for visual domain adaptation. arXiv:1706.05208 (2017)
  7. 7.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The Kitti vision benchmark suite, pp. 3354–3361, May 2012.
  8. 8.
    Giannarou, S., Visentini-Scarzanella, M., Yang, G.Z.: Probabilistic tracking of affine-invariant anisotropic regions. IEEE TPAMI 35(1), 130–143 (2013). Scholar
  9. 9.
    Guerre, A., Lamard, M., Conze, P.H., Cochener, B., Quellec, G.: Optical flow estimation in ocular endoscopy videos using flownet on simulated endoscopy data. In: IEEE International Symposium on Biomedical Imaging (ISBI), pp. 1463–1466 (2018).
  10. 10.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
  11. 11.
    Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 17, 185–203 (1981). Scholar
  12. 12.
    Ihler, S., Laves, M.H., Ortmaier, T.: Patient-specific domain adaptation for fast optical flow based on teacher-student knowledge transfer. arXiv:2007.04928 (2020)
  13. 13.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE CVPR, July 2017.
  14. 14.
    Liu, P., King, I., Lyu, M.R., Xu, J.: DDFlow: learning optical flow with unlabeled data distillation. In: AAAI, vol. 33, pp. 8770–8777 (2019).
  15. 15.
    Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI, New Orleans, Louisiana, pp. 7251–7259, February 2018. arXiv:1711.07837
  16. 16.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: IEEE CVPR (2015).
  17. 17.
    Mountney, P., Stoyanov, D., Yang, G.: Three-dimensional tissue deformation recovery and tracking. IEEE Signal Process. Mag. 27(4), 14–24 (2010). Scholar
  18. 18.
    Reda, F., Pottorff, R., Barker, J., Catanzaro, B.: flownet2-pytorch: pytorch implementation of flownet 2.0: evolution of optical flow estimation with deep networks (2017).
  19. 19.
    Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE CVPR, pp. 2432–2439, June 2010.
  20. 20.
    Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: an empirical study of CNNs for optical flow estimation. arXiv:1809.05571 (2018)
  21. 21.
    Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: IEEE CVPR, pp. 8934–8943 (2018).
  22. 22.
    Wulff, J., Black, M.J.: Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In: IEEE CVPR, pp. 120–130 (2015).
  23. 23.
    Yip, M.C., Lowe, D.G., Salcudean, S.E., Rohling, R.N., Nguan, C.Y.: Tissue tracking and registration for image-guided surgery. IEEE Trans. Med. Imaging 31(11), 2169–2182 (2012). Scholar
  24. 24.
    Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 3–10. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institut für Mechatronische SystemeLeibniz Universität HannoverHanoverGermany
  2. 2.Institut für InformationsverarbeitungLeibniz Universität HannoverHanoverGermany

Personalised recommendations