Introduction

3D scene reconstruction in endoscopic surgery has a significant impact on the development of automated surgery and promotes the advancement of various downstream applications such as surgical navigation, depth perception, augmented reality [1,2,3]. However, there are still many unresolved challenges in dense depth estimation tasks within endoscopic scenes. The variability of soft tissues and occlusion by surgical tools in the surgical environment poses high demands on the model’s ability to reconstruct dynamic depth maps [4]. Recent methods have focused on utilizing binocular information to obtain disparity maps and reconstruct depth information [1, 4]. However, apart from the da Vinci surgical robot system, most endoscopic surgical robot systems only consist of a monocular camera, which is a more cost-effective and easily implementable hardware solution. Therefore, precise depth estimation tasks based on monocular endoscopy are still an area that requires further exploration.

Recently, foundation models have become one of the most popular terms in the field of deep learning [5, 6]. Thanks to their large number of model parameters, foundation models have the ability to build long-term memory of massive training data, achieving state-of-the-art performance on various downstream tasks involving vision, text and multimodal inputs. However, when encountering domain-specific scenarios such as surgical scenes, the predictive ability of foundation models tends to decline significantly [7]. Due to the limited availability of annotated data in medical scenes and insufficient computational resources, training a medical-specific foundation model from scratch poses various challenges. Therefore, there has been extensive discussion on adapting existing foundation models to different sub-domains, maximizing the utilization of existing model parameters, and fine-tuning foundation models for target application scenarios based on limited computational resources [7,8,9]. Chen et al. [8] constructed their adapter using two MLP layers and an activation function without inputting any prompt for fine-tuning the Segment Anything (SAM) model. On the other hand, Wu et al. [9] used a simple pixel classifier as a self-prompt to achieve zero-shot segmentation based on SAM. However, the adapter layer shall slow down inference speed, and prompts cannot be directly optimized through training. Therefore, we have designed our adaptation solution based on Low-Rank Adaptation (LoRA) [10]. LoRA adds a bypass next to the original foundation model, which performs a dimensionality reduction and then an elevation operation to simulate the intrinsic rank. When deployed in a production environment, LoRA can be introduced without introducing inference delays, and only the pre-trained model parameters need to be combined with the LoRA parameters. Therefore, LoRA can serve as an efficient adaption tool in real-world applications of foundation models.

Additionally, current works on fine-tuning vision foundation models to the medical domain have focused on common tasks such as segmentation and detection, with limited exploration in pixel-wise regression tasks like depth estimation. In this case, supervised training paradigms for visual foundation models are typically applied to common semantic understanding tasks and may not be suitable for our needs. Therefore, we have chosen DINOv2 [6] as the starting point for our study in this paper. DINOv2 is a self-supervised trained foundation model for multiple vision tasks. The self-supervised training paradigm enables DINOv2 to effectively learn unified visual features, thereby requiring only customized decoders to adapt DINOv2 to various downstream visual tasks including depth estimation. Therefore, we aim to explore the fine-tuning of the DINOv2 encoder to fully utilize the pre-trained extensive parameters and benefit downstream depth estimation tasks in the surgical domain. Specifically, our key contributions and findings are:

  • We firstly extend the foundation model in computer vision, DINOv2, to explore its capability on medical image depth estimation problems.

  • We present an adaptation and fine-tuning strategy of DINOv2 based on the Low-Rank Adaptation technique with low additional training costs toward the surgical image domain.

  • Our method, Surgical-DINO, is validated on two publicly available datasets and obtained superior performance over other state-of-the-art depth estimation methods for surgical images. We also investigate that the zero-shot foundation model is not yet ready for use in surgical applications, and LoRA adaptation is crucial, which outperformed naive fine-tuning.

Fig. 1
figure 1

The proposed Surgical-DINO framework. The input image is transformed into tokens by extracting scaled-down patches followed by a linear projection. A positional embedding and a patch-independent class token (red) are used to augment the embedding subsequently. We freeze the image encoder and add trainable LoRA layers to fine-tune the model. We extract tokens from different layers, then up-sample and concatenate them to form the embedding features. Another trainable decode head is used on top of the frozen model to estimate the final depth

Methodology

Preliminaries

DINOv2

Learning pre-trained representations without regard to specific tasks has been proven extremely effective in Natural Language Processing (NLP) [11]. One can use features from these pre-trained representations without fine-tuning for downstream tasks and obtain significantly better performances than those task-specific models. Oquab et al. [6] developed a similar "foundation" model, named DINOv2, in computer vision where vision features at both image level and pixel level generated from it can work without any task limitation. They proposed an automatic pipeline to build a large, curated and dedicated image dataset and an unsupervised learning method to learn robust vision features. A ViT model [12] with 1B parameters was trained in a discriminative self-supervised training manner and distilled into a series of smaller models that were evaluated to have surpassing ability against the best available all-purpose features on most of the benchmarks at image and pixel levels. Depth estimation task was also tested as a classical dense prediction task in computer vision by training a simple depth decoder head following DINOv2 and gained excellent performance in the general computer vision realm. The huge domain gap between medical and natural images may impede the utilization of such a foundation model; thus, we first attempt to develop a simple but effective adaptation method to exploit DINOv2 for the surgical domain.

LoRA

Low-Rank Adaptation (LoRA) was first proposed in [10] to fine-tune large-scale foundation models in NLP to downstream tasks. It was inspired by the low “intrinsic dimension” of the pre-trained large model that random projection to a smaller subspace does not affect its ability to learn effectively. By injecting trainable rank decomposition matrices into each layer of the Transformer architecture and freezing the pre-trained model weights, LoRA significantly reduces the amount of trainable parameters for downstream tasks. To be specific, for a pre-trained weight matrix \(W_{0}\in {\mathbb {R}}^{d \times k}\), LoRA utilizes a low-rank decomposition to restrict its update by \( W_{0} + \Delta W = W_{0} + BA\) where \(B\in {\mathbb {R}}^{d \times r}, A\in {\mathbb {R}}^{r \times k}\) with the rank \(r\ll min(d,k)\). \( W_{0}\) does not receive gradient updates during the training process, while only A and B contain trainable parameters. The modified forward pass is then described as:

$$\begin{aligned} h = W_{0}x + \Delta Wx = W_{0}x + BAx. \end{aligned}$$
(1)

This implementation can significantly reduce the memory and storage usage for training thus very suitable for fine-tuning large-scale foundational models to downstream tasks.

Surgical-DINO

As illustrated in Fig. 1, the architecture of our proposed Surgical-DINO depth estimation framework inherits from DINOv2. Given a surgical image \(x\in {\mathbb {R}}^{H\times W\times C}\) with spatial resolution \(H\times W \) and channels C, we aim to predict its depth map \({\hat{D}}\in H\times W\) as close to ground truth depth as possible. DINOv2 serves as an image encoder where images are first split into patches of size \(p^2\) and then flattened with linear projection. A positional embedding is augmented for the tokens and another learnable class token is added which aggregates the global image information for subsequent missions. The image embeddings will then go through a series of Transformer blocks to generate new token representations. All parameters in the DINOv2 image encoder are frozen during training, and we added additional LoRA layers to each Transformer block to capture the learnable information. These side LoRA layers, as illustrated in the previous section, compress the Transformer vision features to the low rank space and then re-project back to match the output features’ channels in the frozen transformer blocks. LoRA layers in each Transformer block work independently and do not share weights. Several intermediate and the final output token representations will be resized and bi-linearly upsampled by a factor of 4 first and then concatenated to output the overall feature representation. A simple trainable depth decoder head is utilized at the end to predict the depth map.

LoRA layers

Different from fine-tuning the whole model, freezing the model and adding trainable LoRA layers will largely reduce the required memory and computation resources for training and also benefit conveniently deploying the model. The LoRA design in Surgical-DINO is presented in Fig. 2. We followed [13] where the low-rank approximation is only applied for q and v projection layers to avoid excessive influence on attention scores. With the aforementioned fundamental formulation of LoRA, for an encoded token embedding x, the processing of qk and v projection layers within a multi-head self-attention block will become:

$$\begin{aligned} \begin{aligned} Q&={\hat{W}}_q a=W_q a+B_q A_q a, \\ K&=W_k a, \\ V&={\hat{W}}_v a=W_v a+B_v A_v a, \\ \end{aligned} \end{aligned}$$
(2)

where \(W_q, W_k\) and \(W_v\) are frozen projection layers for qk and v; \(A_q, B_q, A_v\) and \(B_v\) are trainable LoRA layers. The self-attention mechanism remains unchanged that described by:

$$\begin{aligned} {\text {Att}}(Q, K, V)={\text {Softmax}}\left( \frac{Q K^T}{\sqrt{C_{\text {out }}}}+B\right) V \end{aligned}$$
(3)

where \(C_{\text {out }}\) denotes the numbers of output tokens.

Fig. 2
figure 2

The LoRA design in Surgical-DINO. We apply LoRA only to q and v projection layers in each transformer block. \(W_{q}, W_{k}, W_{v}\) and \( W_{o}\) denote the projection layer of qkv and o, respectively

Network architecture

Image Encoder. The image is first separated into non-overlapping patches and then projected to image embeddings with the Embedding process. The image embeddings are a set of \(t^0=\left\{ t_0^0, \ldots , t_{N_p}^0\right\} , t_n^0 \in {\mathbb {R}}^D\) tokens, where p is the patch size, \(N_p=\frac{HW}{p^{2}}\), \(t_0\) is the class token, and D represents the feature dimensions of each token. L Transformers are then used to transform the image tokens into feature representations \(t^l\) where l denotes the output of l-th Transformer block. We utilized the pre-trained ViT-Base model from DINOv2 as our image encoder with 12 Transformer blocks and a feature dimension of 784.

Depth Decoder. We first extract the layers from \(l={\left\{ 3,6,9,12\right\} }\), unflatten them to fit the patch resolution and up-sample tokens by a factor of 4 to increase the resolution. We treat depth prediction as a classification problem by dividing the depth range into 256 uniformly distributed bins with a linear layer to predict the depth. The predicted map is scaled to align the input resolution eventually.

Loss functions

Surgical-DINO utilizes Scale-invariant depth loss [14] and Gradient loss [15] as the supervision constraints for the fine-tuning process. They can be described by:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\text {pixel }}&=\lambda _{1}\sqrt{\frac{1}{n} \sum _i (g_i)^2-\frac{\lambda _{2}}{n^2}\left( \sum _i g_i\right) ^2} \\ {\mathcal {L}}_{\text {grad }}&=\lambda _{3}\frac{1}{n} \sum _k \sum _i\left( \vert \nabla _x g_i^k\vert + \vert \nabla _y g_i^k \vert \right) \\ \end{aligned} \end{aligned}$$
(4)

where n denotes the number of valid pixels; \( g_{i}^{k} = log {\tilde{d}}_{i}^{k} - log d_{i}^{k}\) is the value of the log-depth difference map at position i and scale k. \({\mathcal {L}}_{\text {pixel }}\) guides the network to predict truth depth, while \({\mathcal {L}}_{\text {grad }}\) encourages the network to predict smoother gradient changes. The final loss is then described as:

$$\begin{aligned} {\mathcal {L}} = {\mathcal {L}}_{\text {pixel }} + {\mathcal {L}}_{\text {grad }}. \end{aligned}$$
(5)
Table 1 Quantitative depth comparison on the SCARED dataset of SOTA depth estimation methods
Table 2 Quantitative depth comparison on Hamlyn dataset

Experiment

Dataset

SCAREDFootnote 1 dataset is collected with a da Vinci Xi endoscope from fresh porcine cadaver abdominal anatomy and contains 35 endoscopic videos with 22,950 frames. A projector is used to obtain high-quality depth maps of the scene. Each video has ground truth depth and ego-motion, while we only used depth to evaluate our method. We followed the split scheme in [16] where the SCARED dataset is split into 15,351, 1705 and 551 frames for the training, validation and test sets, respectively.

HamlynFootnote 2 is a laparoscopic and endoscopic video dataset taken from various surgical procedures with challenging in vivo scenes. We followed the selection in [17] with 21 videos for validation.

Implementation details

The framework is implemented with PyTorch on NVIDIA RTX 3090 GPU. We adopt the AdamW [18] optimizer with an initial learning rate of \(1 \times 10^{-5}\) and weight decay of \(1 \times 10^{-4}\). The batch size is set to 8 with 50 epochs in total. We can achieve our evaluation results with the following weights set: \( \lambda _{1} = 1.0, \lambda _{2} = 0.85, \lambda _{3} = 0.5\). The images are resized to \(224 \times 224\) pixels. We also trained our proposed model in a Self-Supervised Learning (SSL) manner with the baseline of AF-SfMLearner [16]. We replace the encoder in AF-SfMLearner with Surgical-DINO and resize the image to \(224 \times 224\) pixels to fit the patch size of DINOv2.

Performance metrics

We evaluate our method with five common metrics used in depth estimation problems: Abs Rel, Sq Rel, RMSE, RMSE log and \(\delta \) in which lower is better for the first four metrics and larger is better for the last one. During evaluation, we re-scale the predicted depth map by a median scaling method introduced by SfM-Leaner [19], which can be expressed by

$$\begin{aligned} {\textbf{D}}_{\text {scaled }}=\left( {\textbf{D}}_{\text {pred }} *\left( \text { median }\left( {\textbf{D}}_{\text {gt }}\right) / \text { median }\left( {\textbf{D}}_{\text {pred }}\right) \right) \right) . \end{aligned}$$
(6)

We capped the depth map at 150 mm which can cover almost all depth values.

Results

Quantitative results on SCARED. We compared our proposed method with several SOTA self-supervised methods [16, 19,20,21,22,23,24] as well as zero-shot, self-supervised and supervised method, and the results are shown in Table 1. All of these baseline methods were reproduced with the original implementation under the same dataset splits mentioned above. The zero-shot performance of pre-trained DINOv2 is evaluated on model size ViT-Base with a same depth decoder head fine-tuned on NYU Depth V2 [25]. Our method obtained superior performances in all the evaluation metrics compared to all of the methods. It is worth noting that the zero-shot performance of DINOv2 has the worst results indicating that vision features and depth decoder that are highly effective in natural images are unsuitable for medical images due to the large domain gap. While the fine-tuned DINOv2 exceeds other SOTA self-supervised methods in RMSE and RMSE log, it did not gain better performance in the other three metrics proving its prediction to have more large depth errors. Only fine-tuning a depth decoder head is not enough to transfer the vision features to geometric relations within medical images. With the adaptation method of LoRA, the network is able to learn medical domain-specific vision features and relate them with depth information, thus resulting in an improvement in the estimation accuracy.

Quantitative results on Hamlyn. We made zero-shot validation for our model trained on SCARED in Hamlyn dataset without any fine-tuning. For comparison, we zero-shot validate AF-SfMLearner with their best model and obtain the results of Endo-Depth-and-Motion [17] by averaging the 21-fold cross-validation results trained on Hamlyn. As presented in Table 2, our method achieves superior performance against other methods, unveiling the good generalization ability across different cameras and surgical scenes.

Table 3 Comparison of encoder parameters, trainable parameters, trainable parameters’ ratio and full model inference speed
Fig. 3
figure 3

Qualitative depth comparison on the SCARED dataset

Table 4 Ablation study on the rank size on the LoRA layer
Table 5 Ablation study on the size of pre-trained foundation model

Model complexity and speed evaluation. The proposed model’s parameters, trainable parameters, trainable parameters ratio and inference speed are evaluated on an NVIDIA RTX 3090 GPU compared to AF-SfMLearner. Table 3 shows that while Surgical-DINO has a larger amount of parameters, only a very small part of parameters are trainable making it faster to train and converge. The inference speed of Surgical-DINO is slower than AF-SfMLearner, but still in an acceptable range for real-time applications.

Qualitative results. We also show some qualitative results in Fig. 3. Our method can depict anatomical structure well compared to other methods. Nevertheless, the qualitative results of our proposed Surgical-DINO also have drawbacks like lack of continuity which can motivate future modification direction.

Ablation studies

Effects of the rank size on the LoRA layer. A set of comparative experiments is performed to evaluate the effects of rank size on the LoRA layer. We evaluated four different sizes of rank of the LoRA layer, and the results are shown in Table 4. We notice that the performance of Surgical-DINO will increase with the increase of rank size within a certain low range and start to drop when the rank size exceeds a certain value. This phenomenon implies that despite being designed to utilize low-rank decomposition to make the approximation, LoRA still requires certain training parameters to fit downstream tasks. However, too many trainable parameters may mislead the original weights resulting in performance degradation.

Effects of the size of pre-trained foundation model. DINOV2 published four pre-trained ViT foundation models and named them by their size. Table 5 presents the ablation study to investigate the effect of the size of the pre-trained foundation model. We discover that the performance increases with the increase of the pre-trained model size. Larger models inherently have better integration and generalization ability of vision features, thus better fitting downstream tasks. But larger models are also accompanied by larger memory occupancy and training costs, so we chose ViT-Base for our depth estimation method in consideration of the compromise between performance and cost.

Conclusions

Depth estimation is a vital task in robotic surgery and benefits many downstream tasks like surgical navigation and 3D reconstruction. Vision Foundation model that captures universal vision features has been proven to be both effective and convenient in many vision tasks but yet needs more exploration in the surgical domain. We have presented Surgical-DINO, an adapter learning method that utilizes DINOv2, a vision foundation model, for surgical scene depth estimation. We design LoRA layers to fine-tune the network with a small number of additional parameters to adapt to the surgical domain. Experiments have been made on a publicly available dataset and demonstrate the superior performance of the proposed Surgical-DINO. We first explore the direction of deploying the vision foundation model to surgical depth estimation tasks and reveal its enormous potential. Future works could explore the foundation model in a supervised, self-supervised and unsupervised manner to investigate the robustness and reliability in the surgical domain.