Introduction

Camera pose estimation is a well-established computer vision problem at the core of numerous applications of medical robotic systems for minimally invasive surgery (MIS). With a variety of methods proposed in recent years, most approaches have focused on Simultaneous Localization and Mapping (SLAM) and Visual Odometry (VO) frameworks to solve the pose estimation problem. Well-established techniques such as ORB-SLAM2 [1] and ElasticFusion [2] have shown great promise in rigid scenes. More recently, non-rigid cases in MIS using monocular [3,4,5] and stereoscopic cameras [6,7,8] have also been studied. Yet to this day, pose estimation in typical MIS settings remains particularly difficult due to deformations caused by instruments and breathing, self or instrument-based occlusions, textureless surfaces and tissue specularities.

In this work, we tackle the problem of pose estimation in such difficult cases when using a stereo endoscopic camera system.

This allows depth estimation to be computed from parallax, which has been shown to improve robustness of SLAM methods [1]. In contrast to [6, 7] which assume the tissue is smooth and locally rigid, respectively, we avoid making assumptions on the tissue deformation and topology. Instead, we propose a dense stereo VO framework that handles tissue deformations and the complexity of surgical scenes. To do this, our approach leverages geometric pose optimization by inferring where to look at in the scene. At its core, our method uses a Deep Declarative Network (DDN) [9] to enable backpropagation of gradients through the pose optimization.

More specifically, we propose to integrate two adaptive weight maps with the role of balancing the contribution of two geometrical losses and the contribution of each pixel toward each of these losses. We learn these adaptive weight maps using a DDN with the goal of solving the pose estimation problem, inspired by the recent DiffPoseNet [10] approach. Similarly to theirs, our method exploits the expressiveness of neural networks while leveraging robustness from the geometric optimization approach. This allows our method to adapt the contribution of the image region depending on the image content, for each loss but also between the two losses. We thoroughly evaluate our approach by characterizing its performance in comparison with the state-of-the-art on various practical scenarios: rigid scenes, breathing, scanning and deforming sequences. This validation is performed on two different datasets, and we show that our method allows for more precise pose estimation in a wide range of cases.

Method

In the following, we present our depth-based pose estimation approach from an optimization perspective. We first derive our method in terms of context-specific adaptive weight maps in the pose estimation optimization and then show how to learn these from data in an end-to-end way using DDNs to facilitate differentiation [9]. Our proposed method is illustrated in Fig. 1, and we provide a notation overview in Table 1.

Fig. 1
figure 1

Overview of our proposed VO framework, which can be sub-divided into 3 parts (from left to right): (1) optical flow and depth are computed using RAFT [11], (2) weight maps computed and used in (3) to estimate the camera pose \(\textbf{p}^\star _t\). Weight maps \((\omega _{\text {2D}}(\textbf{x})\), \(\omega _{\text {3D}}(\textbf{x}))\) are learned via backpropagation using the Deep Declarative Network (DDN) [9]

For all images in an image sequence, we first employ RAFT [11] to establish correspondences between frames for both the stereo and temporal domain. This model allows disparity and optical flow estimation to be computed simultaneously, based on the observation that both share similar constraints on relative pixel displacements. From these estimates, we extract the horizontal component of the parallax flow \(\mathcal {F}_t'\) at time t as the stereo disparity to compute depth maps \(\mathcal {D}_t\). As we would typically expect large vertical disparities in areas of low-texture or for de-calibrated stereo endoscopes, we use this parallax flow \(\mathcal {F}_t'\) as input for the weight map estimation described in the following.

Pose estimation

In contrast to most existing VO methods, we estimate the camera motion exclusively based on a geometric loss function given that photometric consistency is entailed in optical flow estimation. We thus compute a 2D residual function based on a single-depth map by,

(1)

where \(\pi _{\text {2D}}(\exp (\textbf{p}_t)\,\pi _{\text {3D}}(\mathcal {D}_{t},\textbf{x}))\) is the pixel location based on depth projection and the relative camera pose \(\textbf{p}_t\) that aligns views \(\mathcal {I}^{(l)}_t\) to \(\mathcal {I}^{(l)}_{t-1}\). We scale these residuals by the image dimensions X and Y to make values independent of the image size. Note that we normalize depth maps by the maximum expected depth value, such that rotation and translation components of \(\textbf{p}_t\) have the same order of magnitude and thus equally contribute to the residuals, which is important for a well-conditioned optimization. Ideally, a projected pixel position coincides with the optical flow when the observed scene is rigid and when flow and depth maps are correct.

Table 1 Summary of used notation

While minimizing \(r_{\text {2D}}(\textbf{p}_t,\textbf{x})\) helps to reliably estimate the camera motion in rigid scenes, detection of deformations is most effective by looking at the displacement of points in 3D space. To address this need, we propose to leverage the depth map at \(t-1\) and introduce a 3D residual function by,

$$\begin{aligned} r_{\text {3D}}(\textbf{p}_t,\textbf{x})&= \Big \Vert \exp (\textbf{p}_t)\,\pi _{\text {3D}}\big (\mathcal {D}_t,\textbf{x}\big )\nonumber \\&\quad -\pi _{\text {3D}}\big (\mathcal {D}_{t-1},\textbf{x}+\mathcal {F}_t(\textbf{x})\big )\Big \Vert _2 \,\, , \end{aligned}$$
(2)

which measures the point-to-point alignment of the re-projected depth maps. As opposed to [2], we avoid using a point-to-plane distance as it is less constrained on planar surfaces such as organs (e.g., liver). While a known weakness of the point-to-point distance is its sensitivity to noise in regions with large perspectives, we mitigate this effect by combining 2D and 3D residuals. Intuitively, we expect \(r_{\text {2D}}(\textbf{p}_t,\textbf{x})\) to be most accurate when camera motion is large and \(r_{\text {3D}}(\textbf{p}_t,\textbf{x})\) when deformations are significant. Similar to [11], we use bilinear sampling to warp point clouds from \(\pi _{\text {3D}}\big (\mathcal {D}_{t-1},\textbf{x}\big )\) to \(\pi _{\text {3D}}\big (\mathcal {D}_{t-1},\textbf{x}+\mathcal {F}_t(\textbf{x})\big )\), using our optical flow estimates.

In contrast to conventional scalar-weighted sum of residuals, we propose to weigh each residual using a dedicated weight map that is inferred from the image data. The final residual is computed as,

$$\begin{aligned} r(\textbf{p}_t,\textbf{x})= \omega _{\text {2D}}(\textbf{x})\,r_{\text {2D}}(\textbf{p}_t,\textbf{x}) + \omega _{\text {3D}}(\textbf{x})\,r_{\text {3D}}(\textbf{p}_t,\textbf{x}) \,\, , \end{aligned}$$
(3)

where \(\omega _{\text {2D}}(\textbf{x})\) and \(\omega _{\text {3D}}(\textbf{x})\) are the per-pixel weight maps for the 2D and 3D residuals, respectively.

At its core, our hypothesis is that we can learn how the weight maps should combine the contributions of both \(\omega _{\text {2D}}(\textbf{x})\) and \(\omega _{\text {3D}}(\textbf{x})\) even in challenging situations where tissue deformations are taking place. That is, the role of (\(\omega _{\text {2D}}(\textbf{x})\), \(\omega _{\text {3D}}(\textbf{x})\)) is to (1) weigh relative focus based on the context of tissue deformations, (2) weigh reliable residual functions (2D vs 3D) given a motion pattern and (3) balance the scale of the loss. In “Learning the weight maps” section, we detail how we learn a model to infer these weight maps.

Optimization: To compute the relative pose \(\textbf{p}_t^{\star } \in \mathfrak {se}(3)\), we then optimize,

$$\begin{aligned} \textbf{p}_t^{\star }=\underset{\textbf{p}_t}{{\text {arg min}}} \, \left\{ \sum _{\textbf{x}\in \varvec{\Omega }} r(\textbf{p}_t,\textbf{x})^2\right\} , \end{aligned}$$
(4)

in a Nonlinear Least-Squares (NLLS) problem. Here, \(\varvec{\Omega }\) is a set containing all spatial image coordinates \(\textbf{x}\). We choose to optimize the pose in the Lie algebra vector space \(\mathfrak {se}(3)\) because this is a unique representation of the pose and has the same number of parameters as degrees of freedom. NLLS problems are typically solved iteratively using a second-order optimizer. To do this, we use the quasi-Newton method L-BFGS [12] due to its fast convergence properties and computational efficiency. Identical to [10], we simply chain relative camera poses to obtain the full trajectory.

Learning the weight maps

In Eq. (3), we propose to learn residual weight maps \(\omega _{\text {2D}}(\textbf{x})\) and \(\omega _{\text {3D}}(\textbf{x})\), as determining these otherwise is not trivial. To this end, we train a separate encoder–decoder network, denoted by \(g(\cdot )\), for each weight map. The input to these networks is all the elements used to compute residuals,

$$\begin{aligned} \omega _{\text {2D}}(\textbf{x}) =&g\big (\textbf{x},\mathcal {I}^{(l)}_t, \mathcal {D}_t, \mathcal {F}_t, \mathcal {F}'_t, \varvec{\theta }_{\text {2D}}\big ), \end{aligned}$$
(5)
$$\begin{aligned} \omega _{\text {3D}}(\textbf{x}) =&g\big (\textbf{x}, \mathcal {I}^{(l)}_t, \mathcal {D}_t, \mathcal {F}_t, \mathcal {F}'_t, \mathcal {I}^{(l)}_{t-1}, \mathcal {D}_{t-1},\mathcal {F}'_{t-1}, \varvec{\theta }_{\text {3D}}\big ) \, , \end{aligned}$$
(6)

where \(\varvec{\theta }_{\text {2D}}\) and \(\varvec{\theta }_{\text {3D}}\) are the network parameters that are learned at training time. For \(g(\cdot )\), we employ a 3-layer UNet [13] with Sigmoid activation function to ensure outputs in [0, 1].

To train \(g(\cdot )\), we aim to learn weight maps that lead to improved pose estimation by minimizing the \(\ell ^1\) supervised training loss,

$$\begin{aligned} \mathcal {L}_{\text {train}}=\Vert \textbf{p}^{\star }_{t}-\textbf{p}^{(\text {gt})}_t\Vert _1, \end{aligned}$$
(7)

where \(\textbf{p}^{(\text {gt})}_t\) is the ground-truth pose. Because the pose optimization in Eq. (4) is not directly differentiable, we leverage a DDN [9] to enable end-to-end learning. This approach uses implicit differentiation of Eq. (4) to compute the gradients of the weight map parameters \((\varvec{\theta }_{\text {2D}}\), \(\varvec{\theta }_{\text {3D}})\) with respect to \(\mathcal {L}_{\text {train}}\). Therefore, the only requirements are that (1) the function to be optimized \(\sum _{\textbf{x}\in \varvec{\Omega }} r(\textbf{p}_t,\textbf{x})^2\) is twice differentiable and (2) we find a local or global minimum in the forward pass.

Experiments

Datasets

We evaluate our method on two separate stereo video datasets: one containing rigid MIS scenes and another containing non-rigid scenes:

SCARED dataset [14]: consists of 9 in vivo porcine subjects with 4 sequences for each subject. The dataset contains a video stream captured using a da Vinci Xi surgical robot and camera forward kinematics. All sequences show rigid scenes without breathing motion or surgical instruments. We split the dataset into training (d2, d3, d6, d7) and testing sequences (d1, d8, and d9) where we exclude d4 and d5 due to bad camera calibrations.

StereoMIS: Additionally, we introduce a new in vivo dataset also recorded using a da Vinci Xi surgical robot. Similarly to [14], ground-truth camera poses are generated from the endoscope forward kinematics and synchronized with the video feed. While we expect errors in the absolute camera pose due to accumulated errors in the forward kinematics, relative camera motions are expected to be accurate. It consists of 3 porcine (P1, P2, and P3) and 3 human subjects (H1, H2, and H3) with a total of 16 recorded sequences. We denote sequences with Px_y where Px is the subject and y the sequence number. Sequence durations range from 50 s to 30 min. They contain challenging scenes with breathing motions, tissue deformations, resections, bleeding, and presence of smoke. We assign P1 and H1 to the training set and the rest is kept for testing.

To provide a finer grained performance characterization of methods with this data, we extract from each video a number of short sequences that visually depict one of several possible settings:

  • breathing: only depicts breathing deformations and contains no camera or tool motion,

  • scanning: includes camera motion in addition to breathing deformations,

  • deforming: comprises tissue deformations due to breathing and manipulation or resection of tissue, while the camera is static.

In practice, we select 88 different, non-overlapping, and 150-frames-long sequences from P2, P3, H2, H3 and assigned each to one of the above categories or surgical scenarios (see supplementary material for more information).

Implementation details

Segmentation of surgical instruments

For all experiments, we mask out surgical instrument pixels by setting corresponding residuals to 0. To do this, we use the DeepLabv3+ architecture [15] trained on the EndoVis2018 segmentation dataset [16] to generate instrument masks for each frame. Additionally, we mask out specularities, by means of maximum intensity detection, as they cause optical flow estimations to be ill-defined.

Training and inference

First, we classify all training frames from the SCARED and StereoMIS training sequences into "moving" and "static" based on the camera forward kinematics. We then randomly sample 4000 frames from each sequence keeping a balance between moving and static frames. For each sampled frame, we generate a sample pair with an interval of 1 to 5 frames. We use the forward kinematics of the camera as the ground truth poses change between two frames in a sample pair. Note that forward kinematics entail minor deviations that may propagate during training. We randomly assign \(80\%\) of the sample pairs to the training set and \(20\%\) to the validation set.

For all experiments, we resize images to half resolution (512x640 pixels). We use a batch size of 8 and the Adam optimizer with learning rate \(10^{-5}\). We train for 200 epochs and perform early stopping on the validation loss. We implement our method using PyTorch and train on a NVIDIA RTX3090 GPU. We reach 6.5 frames per second at test time. RAFT is trained on the FlyingThings3D dataset, and we do not perform any fine-tuning.

Metrics and baseline methods

We use trajectory error metrics as defined in [17], namely the absolute trajectory error ATE-RMSE to evaluate the overall shape of the trajectory and the relative pose errors, RPE-trans and RPE-rot, to evaluate relative pose changes from frame to frame. The ATE-RMSE is sensitive to drift and depends on the length of the sequence, whereas the RPE measures the average performance for frame-to-frame pose estimation.

As no stereo SLAM method dedicated for MIS has open-source code or is evaluated on a public dataset with trajectory ground truth, we compare our method to two general state-of-the art rigid SLAM methods that contain loop closure and are based on the rigid scene assumption:

  • ORB-SLAM2 [1], a sparse SLAM leveraging bundle adjustment to compensate drift,

  • ElasticFusion [2], a dense SLAM and as such closer to our proposed method.

In addition, we compare our method to [8] on the frames of the SCARED dataset for which they reported performances. For fair comparison, we use the same input depth maps for all methods.

Results

Surgical scenarios and ablation study: Table 2 reports the performance of our approach on the StereoMIS surgical scenarios. To show the importance of learning the weight maps we perform an ablation study where we evaluate the impact of (1) constant weights, denoted by ours (w/o weight), where \(\omega _{\text {2D}}(\textbf{x})=\omega _{\text {3D}}(\textbf{x})=1\) for each \(\textbf{x}\); (2) our method with only 2D-residuals, denoted by ours (only 2D); and (3) using only 3D-residuals, denoted by ours (only 3D).

Table 2 The ATE-RMSE (mean±std mm) for the different scenarios from the StereoMIS dataset with average over sequences (microavg.) and scenarios (macroavg.)

Our proposed method outperforms the baselines in all scenarios. Improvements in breathing and scanning are partly due to correct identification of errors in the optical flow and depth estimation as well as optimal balancing of the 2D and 3D residuals. Indeed, exploiting the complementary properties of 2D and 3D residuals improves the average performance. The fact that ours (only 3D) outperforms ours (only 2D) in breathing and deforming supports our intuition that it is easier to learn tissue deformations from the 3D residuals. Contrarily, in scanning where the optical flow is dominated by the camera motion, the 2D residuals lead to a more accurate pose estimation.

In general, it is not possible to detect or completely compensate the breathing motion on a frame-to-frame basis in our proposed optimization scheme as we cannot completely disambiguate the camera and tissue motion. However, the method learns which regions are more affected by breathing deformations and consequently assigns a smaller weight to those regions.

We note that the weight maps in Fig. 2 (see breathing rows) support our claims that the weight maps have low values in the dark regions (A) where we expect the optical flow to be inaccurate and where tissue is moving most (B). The scanning example also illustrates that the weight maps have a different response depending on the motion pattern and deformation. Note that the presence of surgical instruments has no influence on the weight maps in scanning, as no tissue interaction takes place. As expected, the largest improvements can be seen in the deforming scenario. Inspecting the two last rows in Fig. 2 reveals that regions where the instruments deform tissue (C) are correctly ignored in the pose estimation. Similarly, the region occluded by smoke (D) has low values in the weight maps. Additionally, we observe that \(\omega _{\text {2D}}\) usually has 100 times larger magnitude than \(\omega _{\text {3D}}\) compensating for the different scale of the 2D and 3D residuals.

Fig. 2
figure 2

Exemplary results for 5 different scenarios. Surgical instruments and specularities are masked out. From left to right: left image, its depth map, its optical flow displacement as well as its weights \(\omega _{\text {2D}}(\textbf{x})\) and \(\omega _{\text {3D}}(\textbf{x})\). Weight maps are normalized (lowest value in dark blue and highest value in yellow). Best viewed in color

Results on full test StereoMIS sequences: Table 3 shows the pose estimation performance on the complete sequences in the StereoMIS testset. As the sequences are much longer than in the scenario experiment, accumulation of drift results in a large ATE-RMSE for all methods. Even though our frame-to-frame approach does not include any bundle adjustment or regularization over time, it still has the lowest ATE-RMSE on average. The reason for this good performance is reflected in the relative metrics RPE-trans and RPE-rot, where our method outperforms all others by almost a factor of three and five, respectively. Our method robustly estimates the pose in challenging situations, whereas ORB-SLAM2 fails in two sequences (H2_0, P2_5). Figure 3 shows two example trajectories. P2_7 does not include any tool–tissue interactions and consists of smooth camera motions. Its trajectory illustrates the drift of our method which results in an ATE-RMSE of 9.28 versus 3.76mm for ORB-SLAM2. On the other hand, P3_0 consists of strong tissue deformations and abrupt camera movements. Despite visible drift, we can see that our method is able to follow the abrupt movements. The small-scale oscillations in the trajectories are due to breathing motion. The trajectories of all test sequences and evaluation results excluding frames where the SLAM methods fail can be found in the supplementary materials.

Table 3 Pose estimation results on full StereoMIS test sequences for ORB-SLAM2 [1], ElasticFusion [2], and ours. Metrics are reported in (mean±std) when applicable
Fig. 3
figure 3

Two example trajectories of the StereoMIS test set. Top:  trajectory of P2_7. Bottom:  trajectory of P3_0

Results on SCARED dataset: Wei et al. reported ATE-RMSE results for rigid surgical scenes of the SCARED dataset in a frame-to-model approach [8]. For the sake of fair comparison, we extend our method to SLAM by accommodating a surfel map model denoted by ours (frame2model), which is equivalent to that used in [8]. Specifically, we replace input images \(\mathcal {I}^{(l)}_{t-1}, \mathcal {I}^{(r)}_{t-1}\) by rendered images from the surfel map. Note, we can only adopt this approach for the SCARED dataset, as the surfel map model assumes scene rigidity. Results are provided in Table 4.

Table 4 The ATE-RMSE in mm for SCARED sequences reported by [8] and microaverage over all SCARED test sequences (SCARED avg.) using surfel maps

Conclusion

We proposed a visual odometry method for robust pose estimation in the challenging context of endoscopic surgeries. To do so, we learn adaptive weight maps for two geometrical residuals to leverage pose estimation performance on common surgical scenes including breathing motion and tissue deformations. Thanks to a performance analysis in common scenarios, we observed the complementary action of the 2D/3D residuals and the strong contribution of their specific weighting at pixel level. This results in better performances compared to state-of-the-art methods, on average and in the most challenging cases. We believe that our contributions are beneficial for some SLAM components, e.g., map building, and therefore downstream applications in MIS. Future work will focus on drift and breathing compensation.

Supplementary information Appendix A: details on StereoMIS. Appendix B: trajectories of StereoMIS test set. Appendix C: Results on full StereoMIS sequences. Appendix D: trajectories of SCARED test set.