1 Introduction

Visual geometry is one of the few areas of computer vision where traditional approaches have partially resisted the advent of deep learning. However, the community has now developed several deep networks that are very competitive in problems such as ego-motion estimation, depth regression, 3D reconstruction, and mapping. While traditional approaches may still have better absolute accuracy in some cases, these networks have very interesting properties in terms of speed and robustness. Furthermore, they are applicable to cases such as monocular reconstruction where traditional methods cannot be used.

A particularly interesting aspect of the structure-from-motion problem is that it can be used for bootstrapping deep neural networks without the use of manual supervision. Several recent papers have shown in fact that it is possible to learn networks for ego-motion and monocular depth estimation only by watching videos from a moving camera (SfMLearner [1]) or a stereo camera pair (MonoDepth [2]). These methods rely mainly on low-level cues such as brightness constancy and only mild assumptions on the camera motion. This is particularly appealing as it allows to learn models very cheaply, without requiring specialized hardware or setups. This can be used to deploy cheaper and/or more robust sensors, as well as to develop sensors that can automatically learn to operate in new application domains.

Fig. 1.
figure 1

(a) Depth and uncertainty prediction on the KITTI dataset: In addition to monocular depth prediction, we propose to predict photometric and depth uncertainty maps in order to facilitate training from monocular image sequences. (b) Overview of the training data flow: two convolutional neural networks are trained under the supervision of a traditional SfM method, and are combined via a joint loss including photo-consistency terms.

In this paper, we build on the SfMLearner approach and consider the problem of learning from scratch a neural network for ego-motion and monocular depth regression using only unlabelled video data from a single, moving camera. Compared to SfMLearner and similar approaches, we contribute three significant improvements to the learning formulation that allows the method to learn better models.

Our first and simplest improvement is to strengthen the brightness constancy loss, importing the structural similarity loss used in MonoDepth in the SfMLearner setup. Despite its simplicity, this change does improve results.

Our second improvement is to incorporate an explicit model of confidence in the neural network. SfMLearner predicts an “explainability map” whose goal is to identify regions in an image where the brightness constancy constraint is likely to be well satisfied. However, the original formulation is heuristic. For example, the explainability maps must be regularized ad-hoc to avoid becoming degenerate. We show that much better results can be obtained by turning explainability into a proper probabilistic model, yielding a self-consistent formulation which measures the likelihood of the observed data. In order to do so, we predict for each pixel a distribution over possible brightnesses, which allows the model to express a degree of confidence on how accurately brightness constancy will be satisfied at a certain image location. For example, this model can learn to expect slight misalignments on objects such as tree branches and cars that could move independently of the camera.

Our third improvement is to integrate another form of cheap supervision in the process. We note that the computer vision community has developed in the past 20 years a treasure trove of high-quality handcrafted structure-from-motion methods (SFM). Thus, it is natural to ask whether these algorithms can be used to teach better deep neural networks. In order to do so, during training we propose to run, in parallel with the forward pass of the network, a standard SFM method. We then require the network to optimize the brightness constancy equation as before and to match motion and depth estimates from the SFM algorithm, in a multi-task setting.

Ideally, we would like the network to ultimately perform better than traditional SFM methods. The question, then, is how can such an approach train a model that outperforms the teacher. There is clearly an opportunity to do so because, while SFM can provide very high-quality supervision when it works, it can also fail badly. For example, feature triangulation may be off in correspondence of reflections, resulting in inconsistent depth values for certain pixels. Thus, we adopt a probabilistic formulation for the SFM supervisory signal as well. This has the important effect of allowing the model to learn when and to which extent it can trust the SFM supervision. In this manner, the deep network can learn failure modalities of traditional SFM, and discount them appropriately while learning.

While we present such improvements in the specific context of 3D reconstruction, we note that the idea of using probabilistic predictions to integrate information from a collection of imperfect supervisory signals is likely to be broadly applicable.

We test our method against SfMLearner, the state of the art in this setting, and show convincing improvements due to our three modifications. The end result is a system that can learn an excellent monocular depth and ego-motion predictor, all without any manual supervision.

2 Related Work

Structure from motion is a well-studied problem in Computer Vision. Traditional approaches such as ORB-SLAM2 [3, 4] are based on a pipeline of matching feature points, selecting a set of inlier points, and optimizing with respect to 3D points and camera positions on these points. Typically, the crucial part of these methods is a careful selection of feature points [5,6,7,8].

More recently, deep learning methods have been developed for learning 3D structure and/or camera motion from image sequences. In [9] a supervised learning method for estimating depth from a single image has been proposed. For supervision, additional information is necessary, either in form of manual input or as in [9], laser scanner measurements. Supervised approaches for learning camera poses include [10,11,12].

Unsupervised learning avoids the necessity of additional input by learning from RGB image sequences only. The training is guided by geometric and photometric consistency constraints between multiple images of the same scene. It has been shown that dense depth maps can be robustly estimated from a single image by unsupervised learning [2, 13], and furthermore, depth and camera poses [14]. While these methods perform single image depth estimation, they use stereo image pairs for training. This facilitates training, due to a fixed relative geometry between the two stereo cameras and simultaneous image acquisition yielding a static scene.

A more difficult problem is learning structure from motion from monocular image sequences. Here, depth and camera position have to be estimated simultaneously, and moving objects in the scene can corrupt the overall consistency with respect to the world coordinate system. A method for estimating and learning structure from motion from monocular image sequences has been proposed in SfMLearner [1]. Unsupervised learning can be enhanced by supervision in cases where ground truth is partially available in the training data, as has been shown in [15]. Results from traditional SfM methods can be used to guide other methods like 3D localization [16] and prediction of occlusion models [17].

Uncertainty learning for depth and camera pose estimation have been investigated in [18, 19] where different types of uncertainties have been investigated for depth map estimation, and in [20] where uncertainties for partially reliable ground truths have been learned.

3 Method

Let \(\mathbf {x}_t\in \mathbb {R}^{H \times W \times 3}\), \(t\in \mathbb {Z}\) be a video sequence consisting of RGB images captured from a moving camera. Our goal is to train two neural networks. The first \(\mathbf {d}=\varPhi _\text {depth}(\mathbf {x}_t)\) is a monocular depth estimation network producing as output a depth map \(\mathbf {d}\in \mathbb {R}^{H\times D}\) from a single input frame. The second \((R_t,T_t : t\in \mathcal {T}) = \varPhi _\text {ego}(\mathbf {x}_t : t\in \mathcal {T})\) is an ego-motion and uncertainty estimation network. It takes as input a short time sequence \(\mathcal {T}=(-T,\dots ,0,\dots ,T)\) and estimates 3D camera rotations and translations \((R_t,T_t),\) \(t\in \mathcal {T}\) for each of the images \(\mathbf {x}_t\) in the sequence. Additionally, it predicts the pose uncertainty, as well as photometric and depth uncertainty maps which help the overall network to learn about outliers and noise caused by occlusions, specularities and other modalities that are hard to handle.

Learning the neural networks \(\varPhi _\text {depth}\) and \(\varPhi _\text {ego}\) from a video sequence without any other form of supervision is a challenging task. However, methods such as SfMLearner [1] have shown that this task can be solved successfully using the brightness constancy constraint as a learning cue. We improve over the state of the art in three ways: by improving the photometric loss that captures brightness constancy (Sect. 3.1), by introducing a more robust probabilistic formulation for the observations (Sect. 3.2) and by using the latter to integrate cues from off-the-shelf SFM methods for supervision (Sect. 3.3).

3.1 Photometric Losses

The most fundamental supervisory signal to learn geometry from unlabelled video sequences is the brightness constancy constraint. This constraint simply states that pixels in different video frames that correspond to the same scene point must have the same color. While this is only true under certain conditions (Lambertian surfaces, constant illumination, no occlusions, etc.), SfMLearner and other methods have shown it to be sufficient to learn the ego-motion and depth reconstruction networks \(\varPhi _\text {ego}\) and \(\varPhi _\text {depth}\). In fact, the output of these networks can be used to put pixels in different video frames in correspondence and test whether their color match. This intuition can be easily captured in a loss, as discussed below.

Basic Photometric Loss. Let \(\mathbf {d}_0\) be the depth map corresponding to image \(\mathbf {x}_0\). Let \((u,v) \in \mathbb {R}^2\) be the calibrated coordinate of a pixel in image \(\mathbf {x}_0\) (so that (0, 0) is the optical centre and the focal length is unit). Then the coordinates of the 3D point that projects onto (uv) are given by \(\mathbf {d}(u,v) \cdot (u,v,1)\). If the roto-translation \((R_t,T_t)\) is the motion of the camera from time 0 to time t and \(\pi (q_1,q_2,q_3)=(q_1/q_3,q_2/q_3)\) is the perspective projection operator, then the corresponding pixel in image \(\mathbf {x}_t\) is given by \((u',v') = g(u,v|\mathbf {d},R_t,T_t) = \pi (R_t \mathbf {d}(u,v)(u,v,1)^\top + T_t)\). Due to brightness constancy, the colors \(\mathbf {x}_0(u,v) = \mathbf {x}_t(g(u,v|\mathbf {d},R_t,T_t))\) of the two pixels should match. We then obtain the photometric loss:

$$\begin{aligned} \mathcal {L} = \sum _{t\in \mathcal {T}-\{0\}} \sum _{(u,v)\in \varOmega } |\mathbf {x}_t(g(u,v|\mathbf {d},R_t,T_t)) - \mathbf {x}_0(u,v)| \end{aligned}$$
(1)

where \(\varOmega \) is a discrete set of image locations (corresponding to the calibrated pixel centres). The absolute value is used for robustness to outliers.

All quantities in Eq. (1) are known except depth and camera motion, which are estimated by the two neural networks. This means that we can write the loss as a function:

$$ \mathcal {L}(\mathbf {x}_t:t\in \mathcal {T}|\varPhi _\text {depth},\varPhi _\text {ego}) $$

This expression can then be minimized w.r.t. \(\varPhi _\text {depth}\) and \(\varPhi _\text {ego}\) to learn the neural networks.

Structural-Similarity Loss. Comparing pixel values directly may be too fragile. Thus, we complement the simple photometric loss (1) with the more advanced image matching term used in [2] for the case of stereo camera pairs. Given a pair of image patches \(\mathbf {a}\) and \(\mathbf {b}\), their structural similarity [21] \({\text {SSIM}}(\mathbf {a},\mathbf {b})\in [0,1]\) is given by:

$$ {\text {SSIM}}(\mathbf {a},\mathbf {b}) = \frac{(2\mu _\mathbf {a}\mu _\mathbf {b})(\sigma _{\mathbf {a}\mathbf {b}}+\epsilon )}{(\mu _\mathbf {a}^2+\mu ^2_\mathbf {b})(\sigma _\mathbf {a}^2+\sigma ^2_\mathbf {b}+\epsilon )} $$

where \(\epsilon \) is a small constant to avoid division by zero for constant patches, \(\mu _\mathbf {a}= \frac{1}{n} \sum _{i=1}^n a_i\) is the mean of patch \(\mathbf {a}\), \(\sigma ^2_\mathbf {a}= \frac{1}{n-1} \sum _{i=1}^n (a_i-\mu _\mathbf {a})^2\) is its variance, and \(\sigma _{\mathbf {a}\mathbf {b}} = \frac{1}{n-1} \sum _{i=1}^n (a_i-\mu _\mathbf {a})(b_i-\mu _\mathbf {b})\) is the correlation of the two patches.

This means that the combined structural similarity and photometric loss can be written as \(\mathcal L = \sum _{(u,v)\in \varOmega }\ell (u,v|\mathbf {x},\mathbf {x}')\) where

$$\begin{aligned} \ell (u,v|\mathbf {x},\mathbf {x}') = \alpha \frac{1 - {\text {SSIM}}(\mathbf {x}|_{\varTheta (u,v)},\mathbf {x}'|_{\varTheta (u,v)})}{2} + (1-\alpha ) | \mathbf {x}(u,v) - \mathbf {x}'(u,v)|. \end{aligned}$$
(2)

The weighting parameter \(\alpha \) is set to 0.85.

Fig. 2.
figure 2

Image matching: the photometric loss terms penalize high values in the \(\ell _1\) difference (d) and SSIM image matching (e) of the target image (a) and the warped source image (c).

Multi-scale Loss and Regularization. Figure 2 shows an example for \(\ell _1\) and SSIM image matching, computed from ground truth depth and poses for two example images of the Virtual KITTI data set [22]. Even with ground truth depth and camera poses, a perfect image matching cannot be guaranteed.

Hence, for added robustness, Eq. (2) is computed at multiple scales. Further robustness is achieved by a suitable smoothness term for regularizing the depth map which is added to the loss function, as in [2].

3.2 Probabilistic Outputs

The brightness constancy constraint fails whenever one of its several assumptions is violated. In practice, common failure cases include occlusions, changes in the field of view, moving objects in the scene, and reflective materials. The key idea to handle such issues is to allow the neural network to learn to predict such failure modalities. If done properly, this has the important benefit of extracting as much information as possible from the imperfect supervisory signal while avoiding being disrupted by outliers and noise.

General Approach. Consider at first a simple case in which a predictor estimates a quantity \(\hat{y} = \varPhi (x)\), where x is a data point and y its corresponding “ground-truth” label. In a standard learning formulation, the predictor \(\varPhi \) would be optimized to minimize a loss such as \(\ell = |\hat{y}- y|\). However, if we knew that for this particular example the ground truth is not reliable, we could down-weight the loss as \(\ell /\sigma \) by dividing it by a suitable coefficient \(\sigma \). In this manner, the model would be less affected by such noise.

The problem with this idea is how to set the coefficient \(\sigma \). For example, optimizing it to minimize the loss does not make sense as this has the degenerate solution \(\sigma =+\infty \).

An approach is to make \(\sigma \) one of the quantities predicted by the model and use it in a probabilistic output formulation. To this end, let the neural network output the parameters \((\hat{y},\sigma ) =\varPhi (x)\) of a posterior probability distribution \(p(y|\hat{y},\sigma )\) over possible “ground-truth” labels y. For example, using Laplace’s distribution:

$$ p(y|\hat{y},\sigma ) = \frac{1}{2\sigma } \exp \frac{-|y - \hat{y}|}{\sigma }. $$

The learning objective is then the negative log-likelihood arising from this distribution:

$$ - \log p(y|\hat{y},\sigma ) = \frac{|y - \hat{y}|}{\sigma } + \log \sigma + \text {const.} $$

A predictor that minimises this quantity will try to guess \(\hat{y}\) as close as possible to y. At the same time, it will try to set \(\sigma \) to the fitting error it expects. In fact, it is easy to see that, for a fixed \(\hat{y}\), the loss is minimised when \(\sigma = |y - \hat{y}|\), resulting in a log-likelihood value of

$$ - \log p(y|\hat{y},|y - \hat{y}|) = \log |y - \hat{y}| + \text {const.} $$

Note that the model is incentivized to learn \(\sigma \) to reflect as accurately as possible the prediction error. Note also that \(\sigma \) may resemble the threshold in a robust loss such as Huber’s. However, there is a very important difference: it is the predictor itself that, after having observed the data point x, estimates on the fly an optimal data-dependent “threshold” \(\sigma \). This allows the model to perform introspection, thus potentially discounting cases that are too difficult to fit. It also allows the model to learn, and compensate for, cases where the supervisory signal y itself may be unreliable. Furthermore this probabilistic formulation does not have any tunable parameter.

Implementation for the Photometric Loss. For the photometric loss (2), the model above is applied by considering an additional output \((\sigma _t)_{t \in \mathcal {T}-\{0\}}\) to the network \(\varPhi _\text {ego}\), to predict, along with the depth map \(\mathbf {d}\) and poses \((R_t,T_t)\), an uncertainty map \(\sigma _t\) for photometric matching at each pixel. Then the loss is given by

$$ \sum _{t\in \mathcal {T}-\{0\}} \sum _{(u,v)\in \varOmega } \frac{\ell (u,v|\mathbf {x}_0,\mathbf {x}_t \circ g_t)}{\sigma _t(u,v)} + \log \sigma _t(u,v), $$

where \(\ell \) is given by Eq. 2 and \(g_t(u,v) = g(u,v|\mathbf {d},R_t,T_t)\) is the warp induced by the estimated depth and camera pose.

3.3 Learning SFM from SFM

In this section, we describe our third contribution: learning a deep neural network that distills as much information as possible from a classical (handcrafted) method for SFM. To this end, for each training subsequence \((\mathbf {x}_t:t\in \mathcal {T})\) a standard high-quality SFM pipeline such as ORB-SLAM2 is used to estimate a depth map \(\bar{\mathbf {d}}\) and camera motions \((\bar{R}_t,\bar{T}_t)\). This information can be easily used to supervise the deep neural network by adding suitable losses:

$$\begin{aligned} \mathcal {L}_\text {SFM} = \Vert \bar{\mathbf {d}} - \mathbf {d}\Vert _1 + \Vert \ln \bar{R}_t R_t^\top \Vert _F + \Vert \bar{T}_t - T_t\Vert _2 \end{aligned}$$
(3)

Here \(\ln \) denotes the principal matrix logarithm, which maps the residual rotation to its Lie group coordinates which provides a natural metric for small rotations.

While standard SFM algorithms are usually reliable, they are far from perfect. This is particularly true for the depth map \(\bar{\mathbf {d}}\). First, since SFM is based on matching discrete features, \(\bar{\mathbf {d}}\) will not contain depth information for all image pixels. While missing information can be easily handled in the loss, a more challenging issue is that triangulation will sometimes result in incorrect depth estimates due for example to highlights, objects moving in the scene, occlusion, and other challenging visual effects.

In order to address these issues, as well as to automatically balance the losses in a multi-task setting [19], we propose once more to adopt the probabilistic formulation of Sect. 3.2. Thus loss (3) is replaced with

$$\begin{aligned}&\!\!\!\mathcal {L}^p_\text {SFM} = \chi _\text {SFM}\left[ \sum _{t\in \mathcal {T}-\{t\}}\left[ \frac{\Vert \ln \bar{R}_t R_t^\top \Vert _F}{\sigma _\text {SFM}^{R_t}} + \log \sigma _\text {SFM}^{R_t} + \frac{\Vert \lambda _T\bar{T}_t - T_t\Vert _2}{\sigma _\text {SFM}^{T_t}} + \log \sigma _\text {SFM}^{T_t} \right] \right. \nonumber \\&+\left. \sum _{(u,v)\in S}\left[ \frac{|(\lambda _\mathbf {d}\bar{\mathbf {d}}(u,v))^{-1} - (\mathbf {d}(u,v))^{-1}|}{\sigma _\text {SFM}^\mathbf {d}(u,v)} + \log \sigma _\text {SFM}^\mathbf {d}(u,v)\right] \right] \qquad \,\, \end{aligned}$$
(4)

where pose uncertainties \(\sigma _\text {SFM}^R,\sigma _\text {SFM}^T\) and pixel-wise depth uncertainty maps \(\sigma _\text {SFM}^\mathbf {d}\) are also estimated as output of the neural network \(\varPhi _\text {ego}\) from the video sequence. \(S\in \varOmega \) is a sparse subset of pixels where depth supervision is available.

The translation and depth values from SFM are multiplied by scalars \(\lambda _T=\sum _t\Vert T_t\Vert /\sum _t\Vert \bar{T}_t\Vert \) and \(\lambda _\mathbf {d}=\text {median}(\mathbf {d})/\text {median}(\bar{\mathbf {d}})\), respectively, because of the scale ambiguity which is inherent in monocular SFM. Furthermore, the binary variable \(\chi _\text {SFM}\) denotes whether a corresponding reconstruction from SFM is available. This allows to include training examples where traditional SFM fails to reconstruct pose and depths. Note that we measure the depth error using inverse depth, in order to get a suitable domain of error values. Thus, small depth values, which correspond to points that are close to the camera, get higher importance in the loss function, and far away points, which are often more unreliable, are down-weighted.

Just as for supervision by the brightness constancy, this allows the neural network to learn about systematic failure modes of the SFM algorithm. Supervision can then avoid to be overly confident about this supervisory signal, resulting in a system which is better able to distill the useful information while discarding noise.

4 Architecture Learning and Details

Section 3 discussed two neural networks, one for depth estimation (\(\varPhi _\text {depth}\)) and one for ego-motion and prediction confidence estimation (\(\varPhi _\text {ego}\)). This section provides the details of these networks. An overview of the network architecture and training data flow with combined pose and uncertainty networks is shown in Fig. 1(b). First, we note that, while two different networks are learned, in practice the pose and uncertainty nets share the majority of their parameters. As a trunk, we consider a U-net [23] architecture similar to the ones used in Monodepth [2] and SfMLearner [1].

Figure 3(a) shows details of the layers of the deep network. The network consists of an encoder and a decoder. The input is a single RGB image, and the output is a map of depth values for each pixel. The encoder is a concatenation of convolutional layers followed by ReLU activations where layers’ resolution progressively decreases and the number of feature channels progressively increases. The decoder consists of concatenated deconvolution and convolution layers, with increasing resolution. Skip connections link encoder layers to decoder layers of corresponding size, in order to be able to represent high-resolution details. The last four convolution layers further have a connection to the output layers of the network, with sigmoid activations.

Fig. 3.
figure 3

Network architecture: (a) Depth network: the network takes a single RGB image as input and estimates pixel-wise depth through 29 layers of convolution and deconvolution. Skip connections between encoder and decoder allow to recover fine-scale details. (b) Pose and uncertainty network: Input to the network is a short image sequence of variable length. The fourfold output shares a common encoder and splits to pose estimation, pose uncertainty and the two uncertainty maps afterwards. While photometric uncertainty estimates confidence in the photometric image matching, depth uncertainty estimates confidence in depth supervision from SfM.

Figure 3(b) shows details of the pose and uncertainty network layers. The input of the network is an image sequence consisting of the target image \(I_t\), which is also the input of the depth network, and n neighboring views before and after \(I_t\) in the sequence \(\lbrace I_{t-n},\dots ,I_{t-1} \rbrace \) and \(\lbrace I_{t+1},\dots ,I_{t+n} \rbrace \), respectively. The output of the network is the relative camera pose for each neighboring view with respect to the target view, two uncertainty values for the rotation and translation, respectively, and pixel-wise uncertainties for photo-consistency and depth. The different outputs share a common encoder, which consists of convolution layers, each followed by a ReLU activation. The pose output is of size \(2n\times 6\), representing a 6 DoF relative pose for each source view, each consisting of a 3D translation vector and 3 Euler angles representing the camera rotation matrix, as in [1]. The uncertainty output is threefold, consisting of pose, photometric, and depth uncertainty. The pose uncertainty shares weights with the pose estimation, and yields a \(2n\times 2\) output representing translational and rotational uncertainty for each source view. The pixel-wise photometric and depth uncertainties each consist of a concatenation of deconvolution layers of increasing width. All uncertainties are activated by a sigmoid activation function.

A complete description of the network architecture is provided in the supplementary material.

5 Experiments

We compare results of the proposed method to SfMLearner [1] which is the only method to our knowledge which estimates monocular depth and relative camera poses from monocular training data only. The experiments show that our method achieves better results that SfMLearner.

Table 1. Depth evaluation in comparison to SfMLearner: We evaluate the three contributions image matching, photometric uncertainty, and depth and pose from SfM. Each of these show an improvement to the current state of the art. Training datasets are KITTI (K), Virtual KITTI (VK) and Cityscapes (CS). Rows 1–7 trained on KITTI.

5.1 Monocular Depth Estimation

For training and testing monocular depth we use the Eigen split of the KITTI raw dataset [24] as proposed by [9]. This yields a split of 39835 training images, 4387 for validation, and 697 test images. We only use monocular sequences for training. Training is performed on sequences of three images, where depth is estimated for the centre image.

The state of the art in learning depth maps from a single image using monocular sequences for training only, is SfMLearner [1]. Therefore we compare to this method in our experiments. The laser scanner measurements are used as ground truth for testing only. The predicted depth maps are multiplied by a scalar \(s = \text {median}(\mathbf {d}^*) / \text {median}(\mathbf {d})\) before evaluation. This is done in the same way as in [1], in order to resolve scale ambiguity which is inherent to monocular SfM.

Table 1 shows a quantitative comparison of SfMLearner with the different contributions of the proposed method. We compute the error measures used in [9] to compare predicted depth \(\mathbf {d}\) with ground truth depth \(\mathbf {d}^*\):

  • Absolute relative difference (abs. rel.): \(\tfrac{1}{N}\sum _{i=1}^N|\mathbf {d}_i-\mathbf {d}^*_i| / \mathbf {d}^*_i\)

  • Squared relative difference (sq. rel.): \(\tfrac{1}{N}\sum _{i=1}^N|\mathbf {d}_i-\mathbf {d}^*_i|^2 / \mathbf {d}^*_i\)

  • Root mean square error (RMSE): \(\big (\tfrac{1}{N}\sum _{i=1}^N|\mathbf {d}_i-\mathbf {d}^*_i|^2\big )^{1/2}\)

The accuracy measures are giving the percentage of \(\mathbf {d}_i\) s.t. \(\max \left( \mathbf {d}_i/\mathbf {d}^*_i,\mathbf {d}^*_i/\mathbf {d}_i\right) =\delta \) is less than a threshold, where we use the same thresholds as in [9].

Fig. 4.
figure 4

Comparison to SfMLearner and ground truth on test images from KITTI.

We compare to the error measures given in [1], as well as to a newer version of SfMLearner provided on the websiteFootnote 1. We also compare to running the code downloaded from this website, as we got slightly different results. We use this as baseline for our method. These evaluation results are shown in rows 1–3 of Table 1. Rows 4–7 refer to our implementation as described in Sect. 3, while changes referred to in each row add to the previous row. The results show that structural similarity based image matching gives an improvement to the brightness constancy loss as used in SfMLearner. The photometric uncertainty is able to improve accuracy while giving slightly worse results on the RMSE, as the method is able to allow for higher errors in parts of the image domain. A more substantial improvement is obtained by adding pose and depth supervision from SFM. In these experiments we used in particular predictions from ORB-SLAM2 [4]. Numbers in bold indicate best performance for training on KITTI. The last three rows show results on the same test set (KITTI eigen split), for the final model with pose and depth from SfM, trained on Virtual KITTI (VK) [22], Cityscapes (CS) [25], and pre-training on Cityscapes with fine-tuning on KITTI (CS+K).

Fig. 5.
figure 5

Training on KITTI and testing on different datasets yields visually reasonable results.

Figure 4 shows a qualitative comparison of depth predicted by SfMlearner against ground truth measurements from a laser scanner. Since the laser scanner measurements are sparse, we densify them for better visualization. While SfMLearner robustly estimates depth, our proposed approach is able to recover many more small-scale details from the images. The last row shows a typical failure case, where the estimated depth is less accurate on regions like car windows. Figure 5 shows a qualitative evaluation of depth prediction for different datasets. The model trained on KITTI was tested on images from Cityscapes [25], Virtual KITTI [22], Oxford RobotCar [26] and Make3D [27], respectively. Test images were cropped to match the ratio of width and height of the KITTI training data. These results show that the method is able to generalize to unknown scenarios and camera settings.

5.2 Uncertainty Estimation

Figure 6 shows example visualizations of the photometric and depth uncertainty maps for some of the images from the KITTI dataset. The color bar indicates high uncertainty at the top and low uncertainty at the bottom. We observe that high photometric uncertainty typically occurs in regions with vegetation, where matching is hard due to repetitive structures, and in regions with specularities which corrupt the brightness constancy assumption, for example car windows or lens flares. High depth uncertainty occurs typically on moving object as for example cars. We further observe that the network often seems to be able to discern between moving and stationary cars.

Fig. 6.
figure 6

Prediction of uncertainty maps: the pixel-wise estimated uncertainty maps allow for higher errors in the image matching at regions with high uncertainty, leading to improved overall network performance. We observe that the photometric uncertainty maps (b) tend to predict high uncertainty for reflective surfaces, lens flares, vegetation, and at the image borders, as these induce high photometric errors when matching subsequent frames. The depth uncertainty maps (c) tend to predict high uncertainties for potentially moving objects, and the sky, where depth values are less reliable. The network seems to be able to discern between moving and stationary cars.

Figure 7 shows rotational, translational, depth and photometric uncertainty versus their respective error. The plots show that uncertainties tend to be lower in regions with good matching, and worse in regions with less good matching.

5.3 Camera Pose Estimation

We trained and tested the proposed method on the KITTI odometry dataset [28], using the same split of training and test sequences as in [1]: sequences 00–08 for training and sequences 09–10 for testing, using the left camera images of all sequences only. This gives a split of 20409 training images and 2792 test images. The ground truth odometry provided in the KITTI dataset is used for evaluation purposes only. Again, depth and pose from SFM are obtained from ORB-SLAM2 [4].

Fig. 7.
figure 7

Uncertainty of rotation, translation, depth, and photo-consistency versus the respective error term. The plots show a correspondence between uncertainty and error.

Table 2. Left: Odometry evaluation in comparison to SfMLearner for the two test sequences 09 and 10. The proposed threefold contributions yield an improvement to the state of the art in Seq. 09 and comparable results in Seq. 10. Right: Concatenated poses with color coded pose uncertainty (green = certain, red = uncertain) for Seq. 09.

Table 2 shows a comparison to SfMLearner with numbers as given in the paper and on the website for the two test sequences 09 and 10. For odometry evaluation, a sequence length of 5 images has been used for training and testing. The error measure is the Absolute Trajectory Error (ATE) [29] on the 5-frame snippets, which are averaged on the whole sequence. The same error measure was used in [1]. We compare results from SfMLearner as stated in the paper and on the website, to the proposed method with uncertainties and depth and pose supervision from SfM. Furthermore we compare to traditional methods ORB-SLAM (results as provided in [1]), and DSO [30]. “Full” refers to reconstruction from all images, and “short” refers to reconstruction from snippets of 5-frames. For DSO we were not able to get results for short sequences, as initialization is based on 5–10 keyframes.

6 Conclusions

In this paper we have presented a new method for simultaneously estimating depth maps and camera positions from monocular image sequences. This method is based on SfMLearning and uses only monocular RGB image sequences for training.

We have improved this baseline in three ways: by improving the image matching loss, by incorporating a probabilistic model of observation confidence and, extending the latter, by leveraging a standard SFM method to help supervising the deep network. Experiments show that our contributions lead to substantial improvements over the current state of the art both for the estimation of depth maps and odometry from monocular image sequences.