# Self-Supervised Domain Adaptation for Patient-Specific, Real-Time Tissue Tracking

- 1 Mentions
- 3.2k Downloads

## Abstract

Estimating tissue motion is crucial to provide automatic motion stabilization and guidance during surgery. However, endoscopic images often lack distinctive features and fine tissue deformation can only be captured with dense tracking methods like optical flow. To achieve high accuracy at high processing rates, we propose fine-tuning of a fast optical flow model to an unlabeled patient-specific image domain. We adopt multiple strategies to achieve unsupervised fine-tuning. First, we utilize a teacher-student approach to transfer knowledge from a slow but accurate teacher model to a fast student model. Secondly, we develop self-supervised tasks where the model is encouraged to learn from different but related examples. Comparisons with out-of-the-box models show that our method achieves significantly better results. Our experiments uncover the effects of different task combinations. We demonstrate that unsupervised fine-tuning can improve the performance of CNN-based tissue tracking and opens up a promising future direction.

## Keywords

Patient-specific models Motion estimation Endoscopic surgery## 1 Introduction

Visual motion can be estimated with sparse tracking, e.g. based on feature matching, or dense tracking algorithms like optical flow estimation. Because tissue deformation can only be fully captured with dense tracking, we focus this work on motion estimation with dense optical flow (OF). Traditionally, OF algorithms are based on conventional image processing with engineered features (an overview is provided in [19]). We will further focus on end-to-end deep learning models [5, 13, 15, 21, 22, 24], as these outperform the conventional methods on public OF benchmarks [3, 7, 16]. Unfortunately, these models show the common speed vs. accuracy trade-off. High accuracy is achieved by high complexity, which leads to slow processing rates. On the other hand, faster models lack the capability to generalize well and provide lower accuracy. The goal is simultaneous fast and accurate flow estimation.

In a previous work
[12], we propose patient-specific fine-tuning of a fast OF model based on an unsupervised teacher-student approach. A high-accuracy, but slow teacher model (FlowNet2 @12 fps
[13]) is used to compute pseudo-labels for endoscopic scenes, which can then be used to fine-tune a fast, but imprecise student model (FlowNet2S @80 fps
[13]) to a patient-specific domain^{1}. Supervised fine-tuning with pseudo-labels improved the accuracy of the student model on the patient-specific target domain drastically at a speed up of factor 6, results are shown in Fig. 1. However, there are two drawbacks to this method. First, at best, the student can only become as good as the teacher. Second, the assumption that the teacher model provides good labels for the patient-specific endoscopic scene might be overly optimistic.

In this work, we propose a significantly extended method for fast patient-specific optical flow estimation. Our work draws inspiration from two research fields. First, in line with our previous work we utilize a **teacher-student** approach. Teacher-student approaches are found in model compression and knowledge distillation
[2, 10] and have attracted attention in other areas such as domain adaptation
[6]. Second, inspired by **self-supervised learning**
[4] we design additional optical flow tasks for the model to solve. This enables us to train on unlabeled data in a supervised manner. Joining both ideas, we propose supervised fine-tuning on pseudo-labels computed from real image pairs, synthetic image pairs created from pseudo motion fields and real image pairs with real (simplified) motion fields. We further apply augmentation to the input images during fine-tuning to pose a higher challenge for the student model. Increasing difficulty for the student is commonly used in classification to boost performance.

Our contributions are as follows: A completely unsupervised fine-tuning strategy for endoscopic sequences without labels. An improved teacher-student approach (student can outperform teacher) and novel self-supervised optical flow tasks for use during unsupervised fine-tuning.

In the following section we will explain our method. Then we describe the experimental setup and implementation details that are followed by our results. The paper concludes with a discussion and an outlook.

## 2 Related Work

Tissue tracking is a challenging task due to often-times sparse texture and poor lighting conditions, especially regarding stable, long-term tracking or real-time processing [23]. The issue of sparse texture was recently tackled for ocular endoscopy, by supervised fine-tuning FlowNet2S to retinal images with virtual, affine movement [9]. This was the first time the authors obtained an estimation accuracy sufficient for successfully mosaicking small scans to retinal panoramas. In our pre-experiments, a fine-tuned FlowNet2S with virtual, affine motion was not able to track tissue deformations, which are more complex the affine motion. Unsupervised training based on UnFlow [15], which does not have the restriction of a simplified motion model, did not converge on sparse textured image pairs in our experiments. In 2018, Armin et al. proposed an unsupervised method (EndoReg) to learn interframe correspondences for endoscopic images [1]. However, we have shown that both FlowNet2 and FlowNet2S outperform EndoReg, achieving higher structural similarity indices (SSI) [12].

## 3 Self-Supervised Teacher-Student Domain Adaptation

*S*which is disjoint from

*T*. Unfortunately, as we don’t have labels, true samples from the target domain are unknown, so we cannot directly deploy supervised domain adaptation. Our aim is therefore to obtain good approximations of target samples. First, we will revisit optical flow estimation and then introduce our proposed three training schemes: pseudo-labels, teacher-warp and zero-flow.

### 3.1 Optical Flow Estimation

Optical flow (OF) estimation is a regression problem to determine visible motion between two images. The displacement of corresponding pixels is interpreted as a vector field describing the movement of each image component [11].

The primary assumption is *brightness consistency*. If we deploy the motion field *y* as a mapping function \(y:x_1 \rightarrow x_2\) to map the pixel values of \(x_1\), we obtain a reconstruction \(y(x_1)\) that is identical to \(x_2\) (neglecting occlusion effects between the images). In practice, an estimated flow field \(\hat{y}\approx y\) is generally not a perfect mapping. The difference in pixel values of \(x_2\) and its reconstruction \(\hat{x}_2 = \hat{y}(x_1)\) can be used to asses the quality of \(\hat{y}\).

Another assumption for OF is *cycle consistency*, in other words the concatenation of a forward mapping \(y:x_1 \rightarrow x_2\) and its reverse mapping \(y^{-1}:x_2 \rightarrow x_1\) cancel each other out and result in zero motion: \(x_1 = y^{-1}\big (y(x_1) \big )\).

### 3.2 Teacher-Student Domain Adaptation

*h*with fixed parameters. We optimize the student model with the following supervised objective function

*y*is not known, we propose three different approximations for self-supervised domain adaptation. An overview is provided in Fig. 2.

**Pseudo Labels.**The first approach is an approximation \(\tilde{y}\) of the motion field \(y:x_1 \rightarrow x_2\) with the teacher model \(h(x_1, x_2) = \tilde{y} \approx y\). This approach is the baseline introduced in our previous work. It combines real image pairs with approximated motion labels, resulting in the following loss function (approximation is underlined):

**Teacher Warp.**The second approach is the generation of synthetic (pseudo) image pairs from known motion fields. We tried random motion fields, however, this leads the student to learn erroneous motion manifolds. We therefore use the teacher’s motion fields to generate image pairs \(\tilde{y}: x_1 \rightarrow \tilde{x}_2\). Note that, \(\tilde{y}\) is not an approximation but a perfect label for pair \((x_1,\tilde{x}_2)\), leading to

**Zero-Flow Regularization.**Our third and final approach is the combination of real images with real motion. This can be achieved with Zero-Flow, in other words \(x_1 = x_2\) and \(y=\mathbf {0}\), leading to the following loss

**Chromatic Augmentation.** To increase the learning challenge for the student, we globally alter contrast \(\gamma \) and brightness \(\beta \) of \(x_v' = \gamma \cdot x_v + \beta \) and add Gaussian noise *n* for each pixel individually (applied to value channel in HSV colour space).

## 4 Experiments

Average cycle consistency error (CCE), average relative end point error EPE*. Best EPE* is achieved with a combined training scheme of teacher labels, teacher warp and zero-flow. CCE for training schemes including zero-flow is not conclusive and is therefore grayed out. (\(\dagger \) not adapted to medical image domain; +F: after fine tuning)

### 4.1 Setup

For our experiments we follow a similar setup to our previous work [12]. The main difference is the extension of the loss function and corresponding variations of the training samples that were added to the datasets.

**Datasets.** For our experiments we use four endoscopic video sequences from the Hamlyn datasets: **scale** and **rotation**: both in vivo porcine abdomen
[17], **sparse** texture (and respiratory motion): in vivo porcine liver
[17], strong **deformation** from tool interaction (and low texture): In vivo lung lobectomy procedure
[8]. All sequences show specular highlights. The sparse-texture sequence shows very challenging lighting conditions. Each dataset represents an individual patient. To simulate intra-operative sampling from a patient-specific target domain we split the sequences into disjoint sets. Training and validation set represent the sampling phase (preparation stage), while the application phase during intervention is represented by the test set. The splits of the subsets were chosen manually, so that training and test data both cover dataset-specific motion. The number of samples for training/validation/test set are: scale - 600/240/397; rotation - 329/110/161; sparse - 399/100/307; deformation - 600/200/100. The left camera was used for all datasets.

**Model.**For all our experiments we used the accurate, high-complexity FlowNet2 framework [13] running at approx. 12 fps as our teacher model \(h\) and its fast FlowNet2S component running at approx. 80 fps as our low-complexity, fast student model \(g\). FlowNet2 achieves very high accuracy on the Sintel benchmark [3], which includes elastic deformation, specular highlights and motion blur, which are common effects in endoscopy. FlowNet2S has been shown to handle sparse textures well [9]. We utilized the implementation and pretrained models from [18].

**Training and Augmentation Parameters:** For training we used Adam optimizer with a constant learning rate of 1e−4. Batch size was set to 8. Training was performed until convergence of validation loss, maximum of 120 epochs. Extension and augmentation of the training data set increases training times. This was not focus of this work but should be addressed in future work. We follow Sun et al.’s recommendation to fine-tune FlowNet2S with the multi-scale L1-loss
[13, 20]. For illumination augmentation, we sampled brightness \(\beta \sim \mathcal {N}(0,3)\) and contrast \(\gamma \sim \mathcal {N}(1,0.15)\) for each image individually. Noise *n* was sampled from \(\mathcal {N}(0,2)\). Value range of input images was between 0 and 255. No normalization was performed on input images.

### 4.2 Results

We provide the relative endpoint error (\(\text {EPE}^* = ||(g(x_1,x_2)- \tilde{y})||_2\)) as well as the cycle consistency error (\(\text {CCE} = ||\mathbf {p} - \hat{y}^{-1}\big (\hat{y}(\mathbf {p}) \big )||_2\), \(\mathbf {p}\): image coordinates) for all our experiments in Table 1. We compare our fine-tuned models with the teacher and the student model, as well as our baseline (only teacher labels). All results were obtained using the test sets described in Sect. 4.1.

The best EPE* is achieved with a combined training scheme of teacher labels, teacher warp and zero-flow. The best EPE* is not necessarily indicating the best model though. A low EPE* indicates high similarity to the teacher’s estimation. Almost all fine-tuned models achieve a better cycle consistency than the original student model which indicates an improved model on the target domain. A low CCE, however, is not a guarantee for a good model, best CCE is achieved by estimating zero-flow (which is not necessarily a correct estimation).

Overall, adding teacher-warp to the training samples always improved the EPE*, however it should not be used on its own. As expected, training with added zero-flow almost always improved cycle consistency. Adding image augmentation seems to enforce conservative estimation. In Fig. 3 and 4 the model predicted less motion than all other training schemes. This may be beneficial in some cases e.g. very noisy data, however, the hyper-parameters for chromatic augmentation need to be chosen carefully to match the target domain.

## 5 Conclusion

We proposed advanced, self-supervised domain adaptation methods for tissue tracking based on a teacher-student approach. This tackles the problem of lacking annotations for real tissue motion to train fast OF networks. The advanced methods improve the baseline in many cases, in some cases even outperform the complex teacher model.

Further studies are required to determine a best overall strategy. We plan to extend the training scheme with further domain knowledge. Pseudo image pairs can be created using affine transformation, which is equivalent to synthetic camera movement. Cycle consistency can and has been used for training and in estimating occlusion maps [15]. DDF-FLow learns occlusion maps using an unsupervised teacher-student approach [14]. This seems like a very promising extension for our approach. Occlusion maps are an essential next step for safe tissue tracking.

## Footnotes

- 1.
Training samples should at best be identical to application samples. We therefore also propose to obtain training samples directly prior to the surgical intervention in the operation room. Intra-operative training time was on average 15 min.

## Notes

### Acknowledgements

This work has received funding from the European Union as being part of the EFRE OPhonLas project.

## References

- 1.Armin, M.A., Barnes, N., Khan, S., Liu, M., Grimpen, F., Salvado, O.: Unsupervised learning of endoscopy video frames’ correspondences from global and local transformation. In: Stoyanov, D., et al. (eds.) CARE/CLIP/OR 2.0/ISIC -2018. LNCS, vol. 11041, pp. 108–117. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01201-4_13CrossRefGoogle Scholar
- 2.Buciluă, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: ACM SIGKDD, pp. 535–541 (2006)Google Scholar
- 3.Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: IEEE ECCV, pp. 611–625 (2012). https://doi.org/10.1007/978-3-642-33783-3_44
- 4.Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: IEEE ICCV, pp. 2051–2060 (2017)Google Scholar
- 5.Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE ICCV (2015). https://doi.org/10.1109/ICCV.2015.316
- 6.French, G., Mackiewicz, M., Fisher, M.: Self-ensembling for visual domain adaptation. arXiv:1706.05208 (2017)
- 7.Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The Kitti vision benchmark suite, pp. 3354–3361, May 2012. https://doi.org/10.1109/CVPR.2012.6248074
- 8.Giannarou, S., Visentini-Scarzanella, M., Yang, G.Z.: Probabilistic tracking of affine-invariant anisotropic regions. IEEE TPAMI
**35**(1), 130–143 (2013). https://doi.org/10.1109/TPAMI.2012.81CrossRefGoogle Scholar - 9.Guerre, A., Lamard, M., Conze, P.H., Cochener, B., Quellec, G.: Optical flow estimation in ocular endoscopy videos using flownet on simulated endoscopy data. In: IEEE International Symposium on Biomedical Imaging (ISBI), pp. 1463–1466 (2018). https://doi.org/10.1109/ISBI.2018.8363848
- 10.Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
- 11.Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell.
**17**, 185–203 (1981). https://doi.org/10.1016/0004-3702(81)90024-2CrossRefGoogle Scholar - 12.Ihler, S., Laves, M.H., Ortmaier, T.: Patient-specific domain adaptation for fast optical flow based on teacher-student knowledge transfer. arXiv:2007.04928 (2020)
- 13.Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE CVPR, July 2017. https://doi.org/10.1109/CVPR.2017.179
- 14.Liu, P., King, I., Lyu, M.R., Xu, J.: DDFlow: learning optical flow with unlabeled data distillation. In: AAAI, vol. 33, pp. 8770–8777 (2019). https://doi.org/10.1609/aaai.v33i01.33018770
- 15.Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI, New Orleans, Louisiana, pp. 7251–7259, February 2018. arXiv:1711.07837
- 16.Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: IEEE CVPR (2015). https://doi.org/10.1109/CVPR.2015.7298925
- 17.Mountney, P., Stoyanov, D., Yang, G.: Three-dimensional tissue deformation recovery and tracking. IEEE Signal Process. Mag.
**27**(4), 14–24 (2010). https://doi.org/10.1109/MSP.2010.936728CrossRefGoogle Scholar - 18.Reda, F., Pottorff, R., Barker, J., Catanzaro, B.: flownet2-pytorch: pytorch implementation of flownet 2.0: evolution of optical flow estimation with deep networks (2017). https://github.com/NVIDIA/flownet2-pytorch
- 19.Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE CVPR, pp. 2432–2439, June 2010. https://doi.org/10.1109/CVPR.2010.5539939
- 20.Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: an empirical study of CNNs for optical flow estimation. arXiv:1809.05571 (2018)
- 21.Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: IEEE CVPR, pp. 8934–8943 (2018). https://doi.org/10.1109/CVPR.2018.00931
- 22.Wulff, J., Black, M.J.: Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In: IEEE CVPR, pp. 120–130 (2015). https://doi.org/10.1109/CVPR.2015.7298607
- 23.Yip, M.C., Lowe, D.G., Salcudean, S.E., Rohling, R.N., Nguan, C.Y.: Tissue tracking and registration for image-guided surgery. IEEE Trans. Med. Imaging
**31**(11), 2169–2182 (2012). https://doi.org/10.1109/TMI.2012.2212718CrossRefGoogle Scholar - 24.Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 3–10. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_1CrossRefGoogle Scholar