Keywords

1 Introduction

Triggered by the seminal work of Eigen et al. [6] the estimation of depth maps from just a single image has become a popular tool in computer vision. Depth estimation in a single image is well known to be highly ambiguous and only works due to strong conditional priors learned from previously seen data. The work from Eigen et al. [6] demonstrated the superiority of deep networks over previous attempts with hand-crafted features [23]. The priors learned by a deep network yield an unprecedented accuracy and generality compared to all previous approaches on single-view depth estimation. Of course, the accuracy is far from being competitive with multi-view reconstruction, which becomes evident when visualizing the depth maps in the form of a point cloud. Nevertheless, estimating the depth from single images is no longer a toy problem but is used in places, where dense multi-view reconstruction is not directly applicable, for instance to initialize a monocular SLAM method [26] or for rough but dense depth estimates in autonomous driving [28].

Often in these applications, an image sequence rather than a single image is available, yet these additional images are typically not exploited. There are two sources of information that are inherent to a sequence of images: (1) the motion parallax by a moving camera; (2) the temporal consistency of successive frames. Exploiting the motion parallax has been approached in Ummenhofer et al. [29] for two frames. The motion parallax was used also in Zhou et al. [32] for unsupervised learning of depth from single image. However, there is no approach yet, that exploits the temporal consistency of successive frames.

In this work, we propose a network architecture based on convolutional LSTMs to capture temporal information from previous frames and to enforce temporally consistent depth estimates in a video. We show that such a network improves over independent frame processing, both relative to a single-frame baseline and relative to the state of the art. While it is well-known that temporal consistency has only relatively small effects on standard performance metrics based on average statistics, the qualitative improvement is much higher due to more stable estimates that do not flicker; see the supplemental video. The temporal consistency is also advantageous when combining multiple depth maps to a joint point cloud. As the depth estimates of successive frames agree more, the resulting point cloud is more consistent, too.

2 Related Work

End-to-end depth estimation from a single image with convolutional neural networks was introduced by Eigen [5, 6]. These works were introduced before the today most common convolutional encoder-decoder architectures, such as FCN [19] and U-Net [22] were available. They use a multi-scale architecture for depth estimation at different spatial resolutions. Joint depth and normal prediction improved the depth estimation results.

Liu et al. [18] combined convolutional networks with superpixel-based conditional random fields. Chackrabarti et al. [3] generate a midlevel representation of the depth with a deep network to find the best matches for the depth values in a post processing step. At the present, Laina et al. [16] yields the best results on benchmarks. This was achieved by replacing the standard convolutional encoder by a ResNet-50 architecture [11] and by a set of computationally efficient up-sampling blocks.

The methods above process each frame independently, even when processing a video. We propose the use of LSTM units [9, 12] to relate the intermediate representation in the network across frames in order to obtain temporally consistent outputs. In particular, we use a convolutional LSTM architecture [31]. The convolutional version of LSTMs captures the spatial context of the input tensor and keeps the number of parameters limited.

Fig. 1.
figure 1

Overview of the multi-resolution recurrent network architecture. The architecture consists of three encoder-decoder networks that predict the depth at different resolution levels. The low-resolution prediction is fed as input to the next, higher-resolution stream. The first two streams also estimate the normals in addition to the depth. The convolutional LSTM layers carry the state from the previous frame. There are residual connections within each encoder-decoder stream and also between the streams.

3 Network Architecture

The architecture in Fig. 1 integrates a typical convolutional encoder-decoder structure with convolutional LSTM layers (in brown) to analyze the image at various levels of abstraction, where for each level, the state representation from the previous frame is carried over to the present frame. This combination with the previous state enforces the temporal consistency of the states and, consequently, also of the output. Like all common encoder-decoder networks, the architecture has residual connections to directly propagate high-resolution features from the encoder to the respective layer in the decoder. The spatial resolution of all convolutional filters is \(3\times 3\). The filters for the up-convolution have size \(4\times 4\) and for the LSTM filters we use \(5\times 5\) filters.

In addition to the multi-scale analysis due to the encoder-decoder architecture, the architecture includes a coarse-to-fine refinement strategy, which first produces the output depth map at a lower resolution (with a loss applied during training). The low-resolution result is successively refined by the next encoder-decoder stream of the multi-resolution architecture until the resolution of the input image is obtained at the output. This coarse-to-fine strategy efficiently implements the network stacking idea, which has been successful for optical flow estimation [13] and depth from two views [29]. We also added recurrent connections between the layers of the different streams. For the intermediate resolutions, the network also computes the surface normals (with a loss applied during training), which is helpful to learn the depth representation for the surfaces in the scene.

3.1 Convolutional LSTM with Leaky ReLU

Recurrent neural network architectures have shown to be able to leverage temporal data for tasks such as language processing [4] and video captioning [2]. We build upon the Long Short Term Memory (LSTM) unit [12] to enable temporally consistent video depth prediction.

In convolutional LSTMs the input tensor \(h^{l-1}_t\) is concatenated with the hidden state tensor \(h^{l}_{t-1}\) before the convolution operation is applied. The leaky ReLU has shown improved performance compared to the tanh activation function in many previous works. Thus, we use it also for the present work. However, our experiments showed that the convolutional LSTM with leaky ReLU is less stable than the LSTM with tanh activation and numerically explodes when processing longer sequences during testing; see Fig. 2. The problem of stability is solved by adding a layer normalization [1] on the cell state \(c_t^l\). Moreover, the normalization layer also allows for faster convergence.

Fig. 2.
figure 2

Network stability at test time with different activation functions used in the LSTM unit. The leaky ReLU activation yields better results than the tanh activation. However, the network becomes unstable over time. Adding layer normalization yields the good performance of the leaky ReLU while being as stable as the tanh activation.

Formally, the convolutional LSTM with leaky ReLU and layer normalization reads:

$$\begin{aligned}&\left( \begin{array}{c} i \\ f \\ o \\ g \end{array} \right) = \left( \begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \text {leakyReLU} \end{array} \right) \circ W^{l} * \left( \begin{array}{c} h_t^{l-1} \\ h_{t-1}^{l} \end{array} \right) , \end{aligned}$$
(1)
$$\begin{aligned}&c_t^l=f \cdot c_{t-1}^l +i\cdot g,\end{aligned}$$
(2)
$$\begin{aligned}&\hat{c}_t^l=\gamma \left( \frac{c_t^l-\mu (c_t^l)}{\sqrt{\text {var}(c_t^l)}}\right) +\beta ,\end{aligned}$$
(3)
$$\begin{aligned}&h_t^l=o \cdot \text {leakyReLU}(\hat{c}_t^l), \end{aligned}$$
(4)

where i, f, o, g are the input, forget, output gates and new input, respectively; \(c_t^l\) and \(c_{t-1}^l\) are the cell states for the current and previous time steps; \(\gamma \) and \(\beta \) are the learned parameters of the layer normalization [1], and \(\mu \) and \(\text {var}\) are the mean and variance of the argument over each single data sample.

4 Training Procedure

We train the recurrent network with a batch size of 2 and a sequence length of 7 by unrolling the network. For each frame the network generates a depth map; we apply a loss on each of the outputs.

4.1 Datasets

We train and evaluate our recurrent network on static and dynamic indoor sequences. For the static sequences we use the NYUv2 [24] and SUN3D [30] datasets. Both datasets feature indoor video sequences of offices, living rooms, etc. filmed with a structured light sensor. We use the raw depth from the sensor for NYUv2 and the TSDF fused depth provided by SUN3D. The NYUv2 dataset has 249 training video sequences and 215 test sequences. We use the SUN3D dataset split proposed by Ummenhofer et al. [29] which has 253 sequences for training and 16 for testing.

For dynamic scene experiments we use the Princeton Tracking Benchmark [25]. The dataset consists of indoor scenes with dynamic objects captured with the Kinect sensor. We use 96 sequences for training and four sequences for test.

4.2 Initialization and Training Strategy

We initialize the network weights using Xavier initialization [8] with modifications proposed by [10] for ReLU functions. We normalize the input image values to the range [−0.5, 0.5] and use inverse depth values \(\xi =1/z\) for parameterizing the depth values. Inverse depth emphasizes distances to close objects, yielding more precise predictions for those objects and allows us to represent points at infinity.

For training we use nearest neighbour sampling to resize the ground truth depth maps to \(256\times 192\). On NYUv2 we first crop images to \(561\times 427\) before downsampling. The output depth has the same resolution as input.

We use ADAM [15] with restarts. The restart technique was proposed for SGD optimization and allows to achieve faster convergence of the network compared to the fixed learning rate schedules [20]. The starting learning rate for each restart is \(10^{-4}\) and it drops to \(10^{-6}\) at the end of each period. The first restart interval is 10000 iterations and it increases by factor 1.5 at each restart.

To avoid overfitting to very long sequences in the training set, we iterate in random order over the set of sequences and sample from each sequence a random segment with 7 frames. This allows the network to see an equal number of training samples from each sequence. Further, we augment the segments by randomly skipping frames with a probability of 0.5.

4.3 Loss Functions

On the depth output, we combine two loss functions. First, we use L1 loss for the inverse depth

$$\begin{aligned} L_\text {depth}=\sum _{i,j}|\xi (i,j) - \hat{\xi }(i,j)|, \end{aligned}$$
(5)

where \(\hat{\xi }(i,j)\) is the ground truth inverse depth. Second, we compute the scale-invariant gradient loss [29] for the inverse depth

$$\begin{aligned} L_\text {grad}=\sum _{h\in {1,2,4,8,16}} \sum _{i,j} ||g_{h}[\xi ](i,j)-g_{h}[\hat{\xi }] (i,j)||_2 \end{aligned}$$
(6)

where \(g_{h}[\xi ](i,j)\) is the discrete scale invariant gradient:

$$\begin{aligned} \textstyle { g_h[f](i,j)=\left( \frac{f(i+h,j)-f(i,j)}{|f(i+h,j)|+|f(i,j)|},\frac{f(i,j+h)-f(i,j)}{|f(i,j+h)| + |f(i,j)|}\right) . } \end{aligned}$$
(7)

In (6) we sum the gradient for five different discretization widths h to cover gradients with different slopes. The gradient loss significantly improves the smoothness of the depth values while preserving sharp depth edges.

The loss on the normals is the non-squared L2 norm

$$\begin{aligned} L_\text {normal}=\sum \nolimits _{i,j}||n(i,j)-\hat{n}(i,j)||_2, \end{aligned}$$
(8)

where n(ij) is the normal predicted by the network and \(\hat{n}(i,j)\) is the ground truth normal, which we derive from the ground truth depth maps.

To balance the importance of the loss functions we use different weights. We assign the weight 300 for the L1 depth loss and 1500 for the scale invariant gradient loss and 100 for the loss on the normals. The weights were set empirically.

For the first 10000 iterations we set the weight for the gradient loss to zero, because the scale invariance of \(L_\text {grad}\) can cause instabilities directly after weight initialization. During training and evaluation, we do not consider pixels with invalid depth values.

4.4 Error Metrics

To quantify the quality of the predicted depth maps we compute several common error metrics:

L1 inverse error: \(\; L1-inv (z,\hat{z})=\frac{1}{N}\sum _i \left| \frac{1}{z_i}-\frac{1}{\hat{z_i}}\right| ,\)

The mean root squared error (RMS): \(\;RMS(z,\hat{z})= \sqrt{ \frac{1}{N}\sum _i({z_i}-\hat{z_i})^{2} }\)

Average \(log_{10}\) error (log10): \(\;log10(z,\hat{z})= \frac{1}{N}\sum _i|\log {z_i}-\log \hat{z_i}|\)

Percentage of pixels below a ratio treshold \(\theta \): \(\;\max \left( \frac{\hat{z_i}}{z_i},\frac{z_i}{\hat{z_i}}\right) = \delta (z,\hat{z}) < \theta .\)

Here \(z_i\) is the depth prediction, \(\hat{z_i}\) is the ground truth depth, and N is the number of valid pixels in a depth map.

5 Experiments

5.1 Choice of the Activation Function

To quantify the behaviour of the LSTM with different activation functions and normalization strategies we first ran experiments with a simplified network architecture, which consists of only one encoder with 10 layers and one decoder with 13 layers. We also compared the recurrent network with the LSTM layers removed, i.e., that network does not take information from previous frames into account. Results on SUN3D dataset for a sequence length of 6 are shown in Table 1. The recurrent architecture improves over the single-frame baseline on all metrics, and the leaky ReLU activation unit always outperforms tanh activation. Thus, for the following experiments, we always used leaky ReLU activation with normalization.

Table 1. Comparison of the recurrent architecture to its non-recurrent baseline for two different activation functions in the LSTM unit on the SUN3D dataset. The recurrent architectures improve over the single-frame baseline on all metrics. The leaky ReLU activation unit always outperforms tanh activation.

5.2 Comparison to the State of the Art

In Table 2, we compare the full network architecture to the state of the art in depth estimation from single image. We trained the network for the first six restarts on the SUN3D dataset and then fine-tuned it for the last restart on the NYUv2 training data. For the evaluation, we randomly sampled 50 sequences from the NYUv2 test set and evaluated on the first 50 frames of each sequence. We evaluated only in the regions with valid depth values.

Table 2. Comparison to the state of the art on 50 sequences from the NYUv2 dataset with a length of 50 frames each. The use of temporal consistency with LSTM yields state-of-the-art results on most metrics. The runtime performance of the methods (frames per second) was estimated on the NVidia Geforce Titan X (Maxwell architecture).

The use of temporal consistency with LSTM yields state-of-the-art results in several metrics. Moreover, the version with temporal consistency outperforms the baseline without LSTM units on all metrics. This clearly shows the benefit of taking previous frames into account.

A qualitative comparison is shown in Fig. 3. Our results have sharper boundaries than the single-frame methods, and there is no flickering in the depth maps estimated over time, as can be seen in the supplemental video.

Fig. 3.
figure 3

Qualitative comparison of our LSTM based network to the independent frame processing of Chakrabarti et al. [3] and Laina et al. [16]. The result of the last image in each 50-frame sequence is shown. The proposed architecture with multi-frame processing yields sharper edges and captures more details.

5.3 Temporal Consistency

We show that our LSTM network learns to predict temporally consistent depth maps. We validate this by comparing depth predictions of our LSTM-based architecture with state of the art single-frame depth estimation networks. In Fig. 4 we show the depth trajectory of a point on a 50 frames sequence from the NYUv2 dataset. And in Fig. 5 we further show the temporal consistency comparison with the average depth change over all pixels.

Fig. 4.
figure 4

We track the depth of a single point over time. We use the KLT tracker [21, 27] to track the point in the image sequence and plot the depth over time. Our LSTM-based architecture is not only more accurate but also more temporally consistent and therefore suited for processing video streams.

Fig. 5.
figure 5

Comparison of the temporal consistency with single frame Liu et al. [17], our LSTM network, and ground truth. The bars represent the average depth change of corresponding points between consecutive depth frames of a sequence of size 10. In order to compute point correspondences we use Farneback optical flow [7]. Since optical flow estimation introduces additional errors we also show the ground truth results.

Temporal consistency is clearly advantageous in static images. In case of dynamic scenes, temporal consistency, which effectively induces smoothing over time, could have negative effects. It is worth noting, though, that the proposed approach does not smooth the resulting depth map, but the intermediate state representation, i.e., the network can learn to smooth along the point trajectories and consider motion boundaries and occlusion areas. To verify the performance of the recurrent architecture in dynamic scenes we compare it to the single-frame baseline on the Princeton tracking benchmark, which comprises dynamic scenes. We evaluated on four sequences with 50 frames each.

The results in Table 3 show that there is still a small advantage for the LSTM-based architecture, yet it is smaller than in static cases. This indicates that the network cannot learn all of the effects mentioned above, but at least it alleviates most of the negative effects.

Table 3. Depth prediction on the dynamic scenes of the Princeton tracking benchmark. The results do not suffer from temporal consistency despite the motion of objects. In the contrary, the results with temporal consistency are even a little better.
Fig. 6.
figure 6

Dynamic scene from the Princeton tracking benchmark. The temporally consistency due to the LSTM helps to reconstruct the precise depth near boundaries of a moving object.

Fig. 7.
figure 7

The result of the RGB-D structure from motion [30] with depth from the neural network of Eigen and Fergus [5], Laina [16] and our recurrent network. For each reconstruction we use a sequence of 25 frames. We use Poisson surface reconstruction to generate the meshes [14]. Inconsistent depth estimates for Eigen and Laina lead to reconstruction artifacts in the surface mesh. The reconstructions from our depth predictions show less artifacts and have more details.

A qualitative result is shown in Fig. 6. The single-frame baseline has problems with the shape of the moving object, while the recurrent network can exploit the additional information from previous frames. The effect is strongest when the object gets occluded and is partially not visible in the single frame.

5.4 3D Reconstruction

We compared the quality of the predicted depth maps also in a full scene reconstruction context, where the depth maps were used as depth channel in an RGB-D SLAM approach [30] to reconstruct a 3D scene from a video sequence. Figure 7 shows the 3D reconstructions.

The temporally consistent depth maps help improve the reconstructed 3D scene, since the variation of the same surface points over time is much reduced. Thus, the point cloud is less noisy which leads to better 3D reconstruction. Also there are less severe misalignments in the scan, since the temporally consistent depth maps are easier to register for the SLAM method.

6 Conclusions

In this work, we have introduced the first depth estimation network that optimizes the temporal consistency of the estimated depth map over multiple frames in a video. We have shown that the LSTM with leaky ReLU yields better results than the traditional convolutional LSTM with tanh activation. In this context, we have also shown the importance of layer normalization for the stability of the recurrent network. The experimental results with the proposed multi-frame processing consistently outperformed those with frame-independent processing both in static and dynamic scenes.