Learning how to robustly estimate camera pose in endoscopic videos

Purpose Surgical scene understanding plays a critical role in the technology stack of tomorrow’s intervention-assisting systems in endoscopic surgeries. For this, tracking the endoscope pose is a key component, but remains challenging due to illumination conditions, deforming tissues and the breathing motion of organs. Method We propose a solution for stereo endoscopes that estimates depth and optical flow to minimize two geometric losses for camera pose estimation. Most importantly, we introduce two learned adaptive per-pixel weight mappings that balance contributions according to the input image content. To do so, we train a Deep Declarative Network to take advantage of the expressiveness of deep learning and the robustness of a novel geometric-based optimization approach. We validate our approach on the publicly available SCARED dataset and introduce a new in vivo dataset, StereoMIS, which includes a wider spectrum of typically observed surgical settings. Results Our method outperforms state-of-the-art methods on average and more importantly, in difficult scenarios where tissue deformations and breathing motion are visible. We observed that our proposed weight mappings attenuate the contribution of pixels on ambiguous regions of the images, such as deforming tissues. Conclusion We demonstrate the effectiveness of our solution to robustly estimate the camera pose in challenging endoscopic surgical scenes. Our contributions can be used to improve related tasks like simultaneous localization and mapping (SLAM) or 3D reconstruction, therefore advancing surgical scene understanding in minimally invasive surgery. Supplementary Information The online version contains supplementary material available at 10.1007/s11548-023-02919-w.


Introduction
Camera pose estimation is a well established computer vision problem at the core of numerous applications of medical robotic systems for minimally invasive surgery (MIS).With a variety of methods proposed in recent years, most approaches have focused on Simultaneous Localization and Mapping (SLAM) and Visual Odometry (VO) frameworks to solve the pose estimation problem.Well established techniques such as ORB-SLAM2 [1] and ElasticFusion [2] have shown great promise in rigid scenes.More recently, non-rigid cases in MIS using monocular [3][4][5] and stereoscopic cameras [6][7][8] have also been studied.Yet to this day, pose estimation in typical MIS settings remains particularly difficult due to deformations caused by instruments and breathing, self or instrument-based occlusions, textureless surfaces and tissue specularities.
In this work, we tackle the problem of pose estimation in such difficult cases when using a stereo endoscopic camera system.This allows depth estimation to be computed from parallax, which has been shown to improve robustness of SLAM methods [1].In contrast to [6,7] which assume the tissue is smooth and locally rigid, respectively, we avoid making assumptions on the tissue deformation and topology.Instead, we propose a dense stereo VO framework that handles tissue deformations and the complexity of surgical scenes.To do this, our approach leverages geometric pose optimization by infering where to look at in the scene.At its core, our method uses a Deep Declarative Network (DDN) [9] to enable back-propagation of gradients through the pose optimization.
More specifically, we propose to integrate two adaptive weight maps with the role of balancing the contribution of two geometrical losses and the contribution of each pixel towards each of these losses.We learn these adaptive weight maps using a DDN with the goal of solving the pose estimation problem, inspired by the recent DiffPoseNet [10] approach.Similarly to theirs, our method exploits the expressiveness of neural networks while leveraging robustness from the geometric optimization approach.This allows our method to adapt the contribution of the image region depending on the image content, for each loss but also between the two losses.We thoroughly evaluate our approach by characterizing its performance in comparison to the stateof-the-art on various practical scenarios: rigid scenes, breathing, scanning and deforming sequences.This validation is performed on two different datasets and we show that our method allows for more precise pose estimation in a wide range of cases.

Method
In the following, we present our depth-based pose estimation approach from an optimization perspective.We first derive our method in terms of contextspecific adaptive weight maps in the pose estimation optimization and then show how to learn theset from data in an end-to-end way using DDNs to facilitate differentiation [9].Our proposed method is illustrated in Fig. 1 and we provide a notation overview in Tab. 1.For all images in an image sequence, we first employ RAFT [11] to establish correspondences between frames for both the stereo and temporal domain.This model allows disparity and optical flow estimation to be computed simultaneously, based on the observation that both share similar constraints on relative pixel displacements.From these estimates, we extract the horizontal component of the parallax flow F t at time t as the stereo disparity to compute depth maps D t .As we would typically expect large vertical disparities in areas of low-texture or for de-calibrated stereo endoscopes, we use this parallax flow F t as input for the weight map estimation described in the following.

Pose estimation
In contrast to most existing VO methods, we estimate the camera motion exclusively based on a geometric loss function given that photometric consistency is entailed in optical flow estimation.We thus compute a 2D residual function based on a single depth map by, where π 2D (exp(p t ) π 3D (D t , x)) is the pixel location based on depth projection and the relative camera pose p t that aligns views I (l) t−1 .We scale these residuals by the image dimensions X and Y to make values independent of the image size.Note that we normalize depth maps by the maximum expected depth value, such that rotation and translation components of p t have the Fig. 1 Overview of our proposed VO framework, which can be sub-divided into 3 parts (from left to right): (1) optical flow and depth are computed using RAFT [11], (2) weight maps computed and used in (3) to estimate the camera pose p t .Weight maps (ω 2D (x), ω 3D (x)) are learned via back propagation using the Deep Declarative Network (DDN) [9].
pixel index in 2D Cartesian coordinate system pt ∈ se(3) ⊂ R 6  relative pose from t to t − 1 in Lie algebra space p t ∈ se(3) ⊂ R 6  relative pose solution in Lie algebra space exp(p) : se(3) → SE (3) matrix exponential from Lie algebra to Lie group learned per-pixel weight for 3D residuals same order of magnitude and thus equally contribute to the residuals, which is important for a well-conditioned optimization.Ideally, a projected pixel position coincides with the optical flow when the observed scene is rigid and when flow and depth maps are correct.While minimizing r 2D (p t , x) helps to reliably estimate the camera motion in rigid scenes, detection of deformations is most effective by looking at the displacement of points in 3D space.To address this need, we propose to leverage the depth map at t − 1 and introduce a 3D residual function by, which measures the point-to-point alignment of the re-projected depth maps.As opposed to [2], we avoid using a point-to-plane distance as it is less constrained on planar surfaces such as organs (e.g., liver).While a known weakness of the point-to-point distance is its sensitivity to noise in regions with large perspectives, we mitigate this effect by combining 2D and 3D residuals.Intuitively, we expect r 2D (p t , x) to be most accurate when camera motion is large and r 3D (p t , x) when deformations are significant.Similar to [11], we use bilinear sampling to warp point clouds from using our optical flow estimates.
In contrast to conventional scalar-weighted sum of residuals, we propose to weigh each residual using a dedicated weight map that is inferred from the image data.The final residual is computed as, where ω 2D (x) and ω 3D (x) are the per-pixel weight maps for the 2D and 3D residuals, respectively.
At its core, our hypothesis is that we can learn how the weight maps should combine the contributions of both ω 2D (x) and ω 3D (x) even in challenging situations where tissue deformations are taking place.That is, the role of (ω 2D (x), ω 3D (x)) is to (1) weigh relative focus based on the context of tissue deformations, (2) weigh reliable residual functions (2D vs 3D) given a motion pattern and (3) balance the scale of the loss.In Sec.2.2, we detail how we learn a model to infer these weight maps.
Optimization: To compute the relative pose p t ∈ se(3), we then optimize, in a Non-Linear Least-Squares (NLLS) problem.Here, Ω is a set containing all spatial image coordinates x.We choose to optimize the pose in the Lie algebra vector space se(3) because this is a unique representation of the pose and has the same number of parameters as degrees of freedom.NLLS problems are typically solved iteratively using a second-order optimizer.To do this, we use the quasi-Newton method L-BFGS [12] due to its fast convergence properties and computational efficiency.Identical to [10], we simply chain relative camera poses to obtain the full trajectory.

Learning the weight maps
In Eq. ( 3), we propose to learn residual weight maps ω 2D (x) and ω 3D (x), as determining these otherwise is not trivial.To this end, we train a separate encoder-decoder network, denoted by g(•), for each weight map.The input to these networks are all the elements used to compute residuals, where θ 2D and θ 3D are the network parameters that are learned at training time.For g(•), we employ a 3-layer UNet [13] with Sigmoid activation function to ensure outputs in [0, 1].
To train g(•), we aim to learn weight-maps that lead to improved pose estimation by minimizing the 1 supervised training loss, where p (gt) t is the ground-truth pose.Because the pose optimization in Eq. ( 4) is not directly differentiable, we leverage a DDN [9] to enable end-to-end learning.This approach uses implicit differentiation of Eq. ( 4) to compute the gradients of the weight map parameters (θ 2D , θ 3D ) with respect to L train .Therefore, the only requirements are that (1) the function to be optimized x∈Ω r(p t , x) 2 is twice differentiable and (2) we find a local or global minimum in the forward pass.

Datasets
We evaluate our method on two separate stereo video datasets: one containing rigid MIS scenes and another containing non-rigid scenes: SCARED dataset [14]: consists of 9 in-vivo porcine subjects with 4 sequences for each subject.The dataset contains a video stream captured using a da Vinci Xi surgical robot and camera forward kinematics.All sequences show rigid scenes without breathing motion or surgical instruments.We split the dataset into training (d2, d3, d6, d7) and testing sequences (d1, d8, and d9) where we exclude d4 and d5 due to bad camera calibrations.
StereoMIS: Additionally, we introduce a new in-vivo dataset also recorded using a da Vinci Xi surgical robot.Similarly to [14], ground truth camera poses are generated from the endoscope forward kinematics and synchronized with the video feed.While we expect errors in the absolute camera pose due to accumulated errors in the forward kinematics, relative camera motions are expected to be accurate.It consists of 3 porcine (P1, P2, and P3) and 3 human subjects (H1, H2, and H3) with a total of 16 recorded sequences.We denote sequences with Px y where Px is the subject and y the sequence number.Sequence duration's range from 50 seconds to 30 minutes.They contain challenging scenes with breathing motions, tissue deformations, resections, bleeding and presence of smoke.We assign P1 and H1 to the training set and the rest is kept for testing.
To provide a finer grained performance characterization of methods with this data, we extract from each video a number of short sequences that visually depict one of several possible settings: • breathing: only depicts breathing deformations and contains no camera or tool motion, • scanning: includes camera motion in addition to breathing deformations, • deforming: comprises tissue deformations due breathing and manipulation or resection of tissue while the camera is static.
In practice, we select 88 different, non-overlapping, and 150-frames-long sequences from P2, P3, H2, H3 and assigned each to one of the above categories or surgical scenarios (see supplementary material for more information).

Segmentation of surgical instruments
For all experiments, we mask out surgical instrument pixels by setting corresponding residuals to 0. To do this, we use the DeepLabv3+ architecture [15] trained on the EndoVis2018 segmentation dataset [16] to generate instrument masks for each frame.Additionally, we mask out specularities, by means of maximum intensity detection, as they cause optical flow estimations to be ill-defined.

Training and inference
First, we classify all training frames from the SCARED and StereoMIS training sequences into "moving" and "static" based on the camera forward kinematics.
We then randomly sample 4'000 frames from each sequence keeping a balance between moving and static frames.For each sampled frame, we generate a sample pair with an interval of 1 to 5 frames.We use the forward kinematics of the camera as the ground-truth pose change between two frames in a sample pair.Note that forward kinematics entail minor deviations that may propagate during training.We randomly assign 80% of the sample pairs to the training set and 20% to the validation set.
For all experiments, we resize images to half resolution (512x640 pixels).We use a batch size of 8 and the Adam optimizer with learning rate 10 −5 .We train for 200 epochs and perform early stopping on the validation loss.We implement our method using PyTorch, and train on a NVIDIA RTX3090 GPU.We reach 6.5 frames per second at test time.RAFT is trained on the FlyingThings3D dataset and we do not perform any fine-tuning.

Metrics and baseline methods
We use trajectory error metrics as defined in [17], namely the absolute trajectory error ATE-RMSE to evaluate the overall shape of the trajectory and the relative pose errors, RPE-trans and RPE-rot, to evaluate relative pose changes from frame to frame.The ATE-RMSE is sensitive to drift and depends on the length of the sequence, whereas the RPE measures the average performance for frame-to-frame pose estimation.
As no stereo SLAM method dedicated for MIS has open source code or is evaluated on a public dataset with trajectory ground-truth, we compare our method to two general state-of-the art rigid SLAM methods that contain loop closure and are based on the rigid scene assumption: • ORB-SLAM2 [1], a sparse SLAM leveraging bundle adjustment to compensate drift, • ElasticFusion [2], a dense SLAM and as such closer to our proposed method.
In addition, we compare, our method to [8] on the frames of the SCARED dataset for which they reported performances.For fair comparison, we use the same input depth maps for all methods.

Results
Surgical scenarios and ablation study: Tab. 2 reports the performance of our approach on the StereoMIS surgical scenarios.To show the importance of learning the weight maps we perform an ablation study where we evaluate the impact of (1) constant weights, denoted by ours (w/o weight), where ω 2D (x) = ω 3D (x) = 1 for each x; (2) our method with only 2D-residuals, denoted by ours (only 2D); and (3) using only 3D-residuals, denoted by ours (only 3D).
Our proposed method outperforms the baselines in all scenarios.Improvements in breathing and scanning are partly due to correct identification of errors in the optical flow and depth estimation as well as optimal balancing of the 2D and 3D residuals.Indeed, exploiting the complementary properties of 2D and 3D residuals improves the average performance.The fact, that ours (only 3D) outperforms ours (only 2D) in breathing and deforming supports our intuition that it is easier to learn tissue deformations from the 3D residuals.Contrarily, in scanning where the optical flow is dominated by the camera motion, the 2D residuals lead to a more accurate pose estimation.
In general, it is not possible to detect or completely compensate the breathing motion on a frame-to-frame basis in our proposed optimization scheme as we cannot completely disambiguate the camera and tissue motion.However, the method learns which regions are more affected by breathing deformations and consequently assigns a smaller weight to those regions.
We note that the weight maps in Fig. 2 (see breathing rows) support our claims that the weight maps have low values in the dark regions (A) where we expect the optical flow to be inaccurate and where tissue is moving most (B).The scanning example also illustrates that the weight maps have a different response depending on the motion pattern and deformation.Note that the presence of surgical instruments has no influence on the weight maps in scanning, as no tissue interaction takes place.As expected, the largest improvements can be seen in the deforming scenario.Inspecting the two last rows in Fig. 2 reveals that regions where the instruments deforms tissue (C) are correctly ignored in the pose estimation.Similarly, the region occluded by smoke (D) has low values in the weight maps.Additionally, we observe that ω 2D usually has 100 times larger magnitude than ω 3D compensating for the different scale of the 2D and 3D residuals.As the sequences are much longer than in the scenario experiment, accumulation of drift results in a large ATE-RMSE for all methods.Even though our frame-to-frame approach does not include any bundle-adjustment or regularization over time, it still has the lowest ATE-RMSE on average.The reason for this good performance is reflected in the relative metrics RPE-trans and RPE-rot, where our method outperforms all others by almost a factor of three and five, respectively.Our method robustly estimates the pose in challenging situations whereas ORB-SLAM2 fails in two sequences (H2 0, P2 5).Fig. 3 shows two example trajectories.P2 7 does not include any tool-tissue interactions and consists of smooth camera motions.Its trajectory illustrates the drift of our method which results in an ATE-RMSE of 9.28 versus 3.76mm for ORB-SLAM2.On the other hand, P3 0 consists of strong tissue deformations and abrupt camera movements.Despite visible drift, we can see that our method is able to follow the abrupt movements.The small scale oscillations in the trajectories are due to breathing motion.The trajectories of all test sequences and evaluation results excluding frames where the SLAM methods fail can be found in the supplementary materials.Results on SCARED dataset: Wei et al. reported ATE-RMSE results for rigid surgical scenes of the SCARED dataset in a frame-to-model approach [8].For the sake of fair comparison, we extend our method to SLAM by accommodating a surfel map model denoted by ours (frame2model), which is equivalent to that used in [8].Specifically, we replace input images I (l) t−1 , I (r) t−1 by rendered images from the surfel map.Note, we can only adopt this approach for the SCARED dataset, as the surfel map model assumes scene rigidity.Results are provided in Tab. 4.

Conclusion
We proposed a visual odometry method for robust pose estimation in the challenging context of endoscopic surgeries.To do so, we learn adaptive weight Table 4 The ATE-RMSE in mm for SCARED sequences reported by [8] and micro average over all SCARED test sequences (SCARED avg.) using surfel maps.d1 k2 d8 k1 d9 k1 d9 k3 avg.SCARED avg.

Fig. 2
Fig.2Exemplary results for 5 different scenarios.Surgical instruments and specularities are masked out.From left to right: left image, its depth map, its optical flow displacement as well as its weights ω 2D (x) and ω 3D (x).Weight maps are normalized (lowest value in dark blue and highest value in yellow).Best viewed in color.

Table 1
Summary of used notation.

Table 2
The ATE-RMSE (mean±std mm) for the different scenarios from the StereoMIS dataset with average over sequences (micro avg.) and scenarios (macro avg.).