1 Introduction

Over the last decade depth sensors have entered the mass market which substantially improved in package size, energy consumption and price. This made depth data an interesting and important auxiliary input for computer vision tasks, for example in pose estimation [14, 35], or scene understanding [16]. However, current sensors are limited by physical and manufacturing constraints. Hence, depth outputs are affected by degenerations due to noise, quantization and missing values, and typically have a low resolution.

To alleviate the use of depth data, recent methods focus on increasing the spatial resolution of the acquired depth maps. A common approach to tackle this problem is to utilize a high-resolution intensity image as guidance [12, 25, 29]. These methods are motivated by the statistical co-occurrences of edges in intensity images and discontinuities in depth. In practical scenarios, however, a depth sensor is not always accompanied by an additional camera and the depth map has to be projected to the guidance image, which is also problematic due to noisy depth measurements. Therefore, approaches that solely rely on the depth input for super-resolution are becoming popular [1, 13, 20].

In contrast to super-resolution methods for depth data, machine learning based methods for natural images [11, 33, 36, 37] are advancing rapidly and achieve impressive results on standard benchmarks. Those methods learn a mapping from a low-resolution input space to a plausible and visually pleasing high-resolution output space. The inference is performed for small, overlapping patches of the image independently, and are then averaged for the final output. This is not optimal for depth data, as it is characterised by textureless, piece-wise affine regions that have sharp depth discontinuities. In contrast, variational methods are especially suited for this task, because the aforementioned prior information can be exploited in the model’s regularization term. A prominent example is the total generalized variation (TGV) [3] that is for example utilized in [12].

In this work we propose a method that combines the advantages of data-driven methods and energy minimization models by combining a deep convolutional network with a powerful variational model to compute an accurate high-resolution output from a single low-resolution depth map input. Deep networks recently demonstrated impressive capabilities in single-image super resolution [22]. We utilize a similar architecture for our network, but instead of just producing the refined depth map as output, we design the network to additionally predict the locations of the depth discontinuities in the high-resolution output space. Both outputs are then used as input for a variational model to refine the high-resolution estimate. The variational model uses an anisotropic TGV pairwise regularization that is weighted by the network output. To integrate the variational method into our network and learn the joint model end-to-end, we unroll all computation steps of the primal-dual optimization scheme [5] that is used for inference with layers of a deep network. Therefore, we name our method ATGV-Net. Finally, we deal with the problem of obtaining accurate ground-truth data for training. The training of deep networks requires a large corpus of data. We demonstrate that we can train our model entirely on synthetic depth data that we generate in large quantities and obtain state-of-the-art results on four different benchmark datasets.

Our contributions can be summarized as follows: (i) We integrate a variational model with anisotropic TGV regularization into a deep network by unrolling the optimization steps of the primal-dual algorithm [5] and train the whole model end-to-end (see Sect. 3). (ii) We demonstrate that our joint model can be trained entirely on synthetic data for single depth map super-resolution (see Sect. 4.1). (iii) Finally, we show that our method improves upon state-of-the-art results on four different benchmark datasets (see Sects. 4.2, 4.3 and 4.4).

2 Related Work

Depth Super-Resolution. In general, the work on super-resolution is roughly divided in approaches that use a series of aligned images to produce a high-resolution output, and single image super-resolution, i.e. approaches that use only one low-resolution image as input. We focus in this related work on the latter as our method falls into this category.

Natural images often contain repetitive structures and therefore, a patch might be visible on different scales within the same image. Glasner et al.  [15] exploit this knowledge in their seminal work. For each image patch they search similar patches across various scales in the image and combine them for a high-resolution estimate. A similar idea is employed for depth data by Hornáček et al.  [20], but instead of reasoning about 2D patches, they reason in terms of patches containing 3D points. The 3D points of the depth map patch can be translated and rotated with six degree of freedom to find related patches within the same depth map. Aodha et al.  [1] search for similar patches not within the same image, but in an ancillary database and they formulate a Markov Random Field (MRF) that enforces smooth transition between the candidate high-resolution patches.

More recently, machine learning approaches have become popular for single image super-resolution. They achieve higher accuracy and are at the same time more efficient in testing, because they do not rely on a computational intensive patch search. Sparse coding approaches [40, 42] learn dictionaries for the low- and high-resolution domains that are coupled via a common encoding. To increase the inference speed, Timofte et al.  [36] replace the \(\ell _1\) norm in the sparse coding step with the \(\ell _2\) norm, which can be solved in closed form and replace a single dictionary by man smaller sub-dictionaries to improve accuracy. In [33], Schulter et al. substitute the flat code-book of sparse coding methods with a random regression forest. A test patch traverses the trees of the forest and each leaf node stores regression coefficients to predict a high-resolution estimate. Deep learning based approaches recently showed very good results for single image super-resolution, too. Dong et al.  [11] train a convolutional network of three layers. The input to the network is the bilinear upsampled low-resolution image and the network is trained with the Euclidean loss on the network output and the corresponding ground-truth high-resolution image. This idea was substantially improved by Kim et al.  [22]. They train a deep network with up to 20 convolutional layers with filters of size \(3 \times 3\) and therefore, increasing the receptive field to \(41 \times 41\) pixel from \(15 \times 15\) pixels of the network in [11]. Further, the network does not output directly the high-resolution estimate, but the residual to the pre-processed input image, aiding training of the very deep networks [19].

These learning based methods have mainly been applied to color images, where a huge amount of training data can be easily obtained. In contrast, large datasets with dense, accurate depth maps have only very recently become available, e.g.  [17]. Therefore, most methods for depth map super-resolution are not based on machine learning, but utilize a high-resolution intensity image as guidance. One of the first works in this direction is by Diebel and Thrun [9]. They apply a MRF for the upsampling task and weight their smoothness term according to the gradients of the guidance image. Yang et al.  [41] propose an approach based on a bilateral filter that is iteratively applied to estimate a high-resolution output map. Park et al.  [29] present a least-squares method, that incorporates edge aware weighting schemes in the regularization term of their formulation. A more recent approach of Ferstl et al.  [12] utilizes a variational framework for image guided depth upsampling, where they also use the total generalized variation [3] as regularization term. One of the few machine learning based approaches for depth map super-resolution is by Kwon et al.  [25]. They collect their own training data using KinectFusion [21] and facilitate sparse coding with an additional multi-scale approach and an advanced edge weighting term, that emphasizes intensity edges corresponding to depth discontinuities. Ferstl et al.  [13] use sparse coding with dictionaries trained on the 31 synthetic depth maps of [1] to predict the depth discontinuities in the high-resolution domain from the low-resolution depth data. Those edge estimates are then used in an anisotropic diffusion tensor of their regularization term.

Deep Network Integration of Energy Minimization Methods. Energy minimization methods, such as Markov Random Fields (MRFs), or variational methods have a wide range of applications in computer vision. They consist of unary terms, for example the class likelihood of a pixel for semantic segmentation, or the depth value in depth super-resolution, and pairwise terms, which measure the dependencies on neighbouring pixels. Recently, the integration of those models into deep networks gained a lot of attention, as deep networks jointly trained with energy minimization methods achieve excellent results. For example, Tompson et al.  [38] propose the joint training of a convolutional network and a MRF for human pose estimation. The MRF is realized by very large convolutional filters to model the pairwise interactions between joints and can be interpreted as one iteration of loopy belief propagation. In [8, 34] the authors show how to compute the derivative with respect to the mean field approximation [24] in MRFs. This allows end-to-end learning and improves results for instance in semantic segmentation. Similarly, Zheng et al.  [43] show that the computation steps of the mean field approximation can be modeled by operations of a convolutional network and unroll the iterations on top of their network.

While the latter approaches for semantic segmentation are designed for a discrete label space, the variational approach by Ranftl and Pock [31] has a continuous output space. They show that the gradient of a loss function can be back-propagated through the energy functional of a variational method by implicit differentiation, if the functional is smooth enough. This approach has been extended for depth denoising and upsampling by Riegler et al.  [32]. Recently, Ochs et al.  [28] propose a technique that allows the back-propagation through non-smooth energy functionals using Bregman proximity functions [6], but did not demonstrate the use in combination with deep networks.

Our approach utilizes a variational method on top of a deep network, but instead of implicitly differentiating the energy functional as in [31, 32] we unroll every step of an exact optimization scheme [5], in the spirit of [10]. This has two major advantages: First, we can incorporate stronger pairwise regularization terms and second, the optimization gets more robust, allowing the successful training of deeper networks. This is similar to [43], but instead of the mean field approximation, we unroll the steps of the primal-dual algorithm by Chambolle and Pock [5], which converges to the global optimal solution of the convex energy functional. For parametrizing the variational method we use a 10 layer deep network of \(3 \times 3\) convolutions, and train on the residual similarly to [22]. Additionally, we train the network to predict the depth discontinuities in the high-resolution output space. This output is used to weight the pairwise regularization term of the variational part. Finally, we demonstrate that we can train a deep network for this task by rendering synthetic depth maps in large quantities with a ray-caster running on the GPU.

3 ATGV-Net

In this section we describe our method that takes a single low-resolution, probably noisy depth map as input and computes a high-resolution output. We first introduce the notation used throughout this work and then detail our variational model, how we integrate it on top of a deep network and finally the network itself.

In the remainder of this work we denote the low-resolution depth map input as \(s_k^\text {(lr)} \in \mathbb {R}^{M \times N}\). Further, for training we assume that we have for each input sample an accurate, high-resolution ground-truth depth map \(t_k\in \mathbb {R}^{\rho M \times \rho N}\), where \(\rho > 1\) is the given upsampling factor. The only preprocessing step in our method is a bilinear upsampling of the low-resolution input depth map \(s_k^\text {(lr)}\) to the size of the ground-truth target depth map. We denote this mid-level representation of the input as \(s_k\in \mathbb {R}^{\rho M \times \rho N}\).

Given a training set \(\left\{ (s_k, t_k) \right\} _{k=1}^{K}\) of \(K\) training pairs we follow [31, 32] and formulate the training task as the following bi-level optimization problem:

figure a

This optimization problem has an intuitive interpretation: In the higher-level problem (HL) we want to minimize some weights w, such that the minimizer \(u^*\) of the energy functional E in the lower-level problem (LL), which is parameterized by a learnable function f, achieves a low loss L over all training samples. We provide more details on the energy functional and on the parametrization in Sects. 3.1 and 3.2, respectively. For the loss we only impose the restriction that we can compute the gradient with respect to \(u^*\). For the remainder of this work we will use the Euclidean loss:

$$\begin{aligned} L(u^*(f(w, s_k)), t_k) = \left||u^*(f(w, s_k)) - t_k\right||_2^2 . \end{aligned}$$
(1)

The authors of [31, 32] have proven that the bi-level optimization problem can be solved by implicit differentiation, if certain assumptions for the energy functional E hold. Namely, E has to be strongly convex, twice differentiable with respect to u and once differentiable with respect to f. Further, the gradient of f has to be computable with respect to \(w\). The last constraint is satisfied by construction since the parametrization f is realized by a deep network. However, the first constraints drastically limit the choice of energy functionals and therefore, the authors of [31, 32] had to design smooth approximations. In the following we show that this constraints can be eliminated by unrolling the optimization steps of the lower-level problem (LL) on top of a deep network, similar to [43].

Fig. 1.
figure 1

Our model consists of a deep convolutional network with \(L = 10\) layers (blue rectangles) that predicts a first high-resolution depth map and depth discontinuities. The output of the network is then feed to an unrolled primal-dual optimization algorithm (red rectangles) realized by operations in a deep network that further refines the result. This enables us to train the joint model end-to-end. Best viewed magnified in the electronic version.

3.1 Unrolling the Optimization

For the energy functional we have the requirement that it should refine the initial high-resolution depth estimate. Therefore, we use a \({{\mathrm{TGV}}}_2\)-\(\ell _2\) variational model [3] that favors the piecewise affine surfaces apparent in depth maps. In addition, we incorporate an anisotropic diffusion tensor [30, 39] into the regularization and name our model ATGV-Net. The optimization of the energy functional in conjunction with a guidance intensity image already provides good results for depth super-resolution [13]. In the following we demonstrate, how we can significantly improve the model by parametrizing the energy functional by a deep network and learn it end-to-end by unrolling the optimization procedure.

In general, our energy functional consists of a pairwise regularization term R and an \(\ell _2\) data term:

$$\begin{aligned} E(u; f(w, s_k)) = R(u, h(w_h, s_k)) + \frac{e^{w_\lambda }}{2} \left||u - g(w_g, s_k)\right||_2^2. \end{aligned}$$
(2)

The functional is parameterized by a function \(f(w, s_k) = [h(w_h, s_k), w_\lambda , g(w_g, s_k)]^T\) that has learnable weights \(w\) and takes the mid-resolution depth map \(s_k\) as input. The functions h and g are realized as a single deep network and described in Sect. 3.2. The parameter \(w_\lambda \) controls the trade-off between data and regularization term and is also learned. We take the exponential of \(w_\lambda \) to ensure convexity of the energy functional. For the pairwise regularization term we utilize the total generalized variation (TGV) [3] of second order that favors piecewise affine solutions and is therefore ideal for depth maps:

$$\begin{aligned} R(u, h(w_h, s_k)) = \min _v \alpha _1 \left||T(h(w_h, s_k)) (\nabla _u u - v)\right||_1 + \alpha _0 \left||\nabla _v v\right||_1, \end{aligned}$$
(3)

where \(\alpha _0\) and \(\alpha _1\) are user defined parameters. In the regularization term, an anisotropic diffusion tensor T enforces a low degree of smoothness across depth discontinuities and vice versa, more smoothness in homogeneous regions. This anisotropic diffusion tensor is based on the Nagel-Enkelmann operator [27]:

$$\begin{aligned} T(h(w_h, s_k)) = \exp (-\beta \left||h(w_h, s_k)\right||_2^\gamma ) n n^T + n_\perp n_\perp ^T, \end{aligned}$$
(4)

with \(\beta \) and \(\gamma \) being adjustable parameters weighting the magnitude and sharpness of the tensor. The gradient normal of h is given by

$$\begin{aligned} n = \frac{h(w_h, s_k)}{\left||h(w_h, s_k)\right||_2} \,,\quad n_\perp \cdot n = 0. \end{aligned}$$
(5)

To optimize this energy functional we chose the first-order primal-dual algorithm by Chambolle and Pock [5], as it guarantees fast convergence. To apply the optimization algorithm, we first reformulate Eq. (2) as saddle-point problem with dual variables pq as

$$\begin{aligned}&\min _{u, v} \max _{p, q} \alpha _1 \left\langle T(h(w_h, s_k)) (\nabla _u u - v), p \right\rangle + \alpha _0 \left\langle \nabla _v v, q \right\rangle + \frac{e^{w_\lambda }}{2} \left||u - g(w_g, s_k)\right||_2^2 \end{aligned}$$
(6)
$$\begin{aligned}&\text {s.t.}\, p \in \left\{ p \in \mathbb {R}^{2 \times \rho M \times \rho N} \mid \left||p_{:, i, j}\right||_2 \le 1 \right\} , q \in \left\{ q \in \mathbb {R}^{4 \times \rho M \times \rho N} \mid \left||q_{:, i, j}\right||_2 \le 1 \right\} , \end{aligned}$$
(7)

where \(\nabla _u\) and \(\nabla _v\) denote operators in the discrete setting that compute the forward differences of u and v. A single iteration of the optimization procedure to obtain \(u^*\) is then given by:

$$\begin{aligned} p^{n+1}&= {{\mathrm{proj}}}(p^n + \sigma _p \alpha _1 (T(h(w_h, s_k)) (\nabla _u \bar{u}^n - \bar{v}^n))) \end{aligned}$$
(8)
$$\begin{aligned} q^{n+1}&= {{\mathrm{proj}}}(q^n + \sigma _q \alpha _0 \nabla _v \bar{v}^n) \end{aligned}$$
(9)
$$\begin{aligned} u^{n+1}&= \frac{u^n + \tau _u (\alpha _1 \nabla _u^T T(h(w_h, s_k)) p^{n+1} + e^{w_\lambda } g(w_g, s_k))}{1 + \tau _u e^{w_\lambda }} \end{aligned}$$
(10)
$$\begin{aligned} v^{n+1}&= v^n + \tau _v (\alpha _0 \nabla _v^T q^{n+1} + \alpha _1 T(h(w_h, s_k)) p^{n+1}) \end{aligned}$$
(11)
$$\begin{aligned} \bar{u}^{n+1}&= u^{n+1} + \theta (u^{n+1} - {u}^n) \end{aligned}$$
(12)
$$\begin{aligned} \bar{v}^{n+1}&= v^{n+1} + \theta (v^{n+1} - {v}^n) \,, \end{aligned}$$
(13)

with \(u^0 = g(w_g, s_k)\), \(v^0, p^0, q^0 = 0\), \(\sigma _p, \sigma _q, \tau _u, \tau _v > 0\), \(\theta \in [0, 1]\), and \({{\mathrm{proj}}}(p) = \tfrac{p}{\max (1, \left||p\right||_2)}\) is the point-wise projection to the unit hyper-sphere:

The key observations are: (i) The single computation steps in this optimization algorithm can be realized by operations of a deep network, i.e. individual network layers, and (ii) given a fixed number of iterations, the algorithm can be unrolled like a recurrent neural network, similar to [43]. This allows us to use the back-propagation algorithm to train the optimization procedure, i.e. all hyper-parameters, jointly with the parametrization, i.e. the deep network. See Fig. 1 for a visualization of the concept. In the following we detail how the individual computation steps are realised within our model. We provide a graphical representation of a single iteration of the optimization procedure in terms of deep network operations in the supplemental material.

Dual Update. The gradient ascent of the dual variables in Eqs. (8) and (9) consists of scalar multiplication, point-wise addition and multiplication, the gradient operators \(\nabla _u, \nabla _v\), and the projection. The scalar multiplication and the point-wise operations are trivial operations and are implemented in most deep learning frameworks. The \(\nabla \)-operator is basically a convolution with two filters, \(\nabla _x = [-1, 1]\) and \(\nabla _y = [-1, 1]^T\). Therefore, it can be implemented with a standard convolutional layer that has fixed filter coefficients. Additionally, we have to ensure a reflecting padding of the layer input, i.e. Neumann boundary conditions. Finally, the \({{\mathrm{proj}}}\)-operator is a composition of a point-wise division, a \(\max \)-operator and the \(\ell _2\) norm. We implemented the \(\max \)-operator as shifted \({{\mathrm{ReLU}}}\), and the \(\ell _2\) norm as custom layer.

Primal Update. The gradient descent of the primal variables in Eqs. (10) and (11) consists of similar operations as the dual update, and therefore, can be implemented with the same building blocks. Additional operators are \(\nabla _u^T, \nabla _v^T\). These operators are defined as \(\nabla ^T p = \nabla _x p_x + \nabla _y p_y\). From this definition we can see that this operation can again be implemented with a convolutional layer that has fixed filter coefficients. However, we have to ensure a negative symmetric padding of the layer input, i.e. Dirichlet boundary conditions.

Over-Relaxation. The over-relaxation step of the primal variables in Eqs. (12) and (13) can be simplified to a weighted sum of two terms, i.e. \(\bar{u} = (1 + \theta ) u^{n+1} - \theta u^n\) and \(\bar{v} = (1 + \theta ) v^{n+1} - \theta v^n\).

Fig. 2.
figure 2

Overview of our deep network architecture. Our network consists of 10 convolutional layers with \(3 \times 3\) filters and 64 feature maps in the hidden layers (blue rectangles). The input to the network (green rectangle) is the mid-resolution depth map and the output is (i) the residual that after adding to the mid-resolution input produces the high-resolution estimate \(g(w_g, s_k)\) and (ii) the estimates of the depth discontinuities in the high-resolution output \(h(w_h, s_k)\) (red rectangles). Best viewed magnified in the electronic version.

3.2 Parametrization

After we have described the variational model and how to integrate it on top of a deep network, we now detail the parametrization functions \(h(w_h, s_k)\) and \(g(w_g, s_k)\). Inspired by the recent success in single image super-resolution for color images [22], we implement \(g(w_g, s_k)\) as a deep convolutional neural network with 10 convolutional layers. Each convolutional filter has the size of \(3 \times 3\) and each hidden layer of the network has 64 feature maps. As \(g(w_g, s_k)\) is used in the data term of our energy functional it should provide a good initial estimate of the high-resolution depth map. However, the output of this network is not the estimate of the high-resolution depth map itself, but the residual \(g_r(w_g, s_k)\), such that \(g(w_g, s_k) = g_r(w_g, s_k) + s_k\). Learning the residual instead of the full output aids the training procedure of the network [22], and has been applied before in other super-resolution methods [33, 36, 37].

The parameterization function \(h(w_h, s_k)\) is used for weighting the pairwise regularization term. As we argued before, the regularization should be small near depth discontinuities and high in smooth areas. Therefore, we implemented \(h(w_h, s_k)\) as an additional network output of size \(2 \times \rho M \times \rho N\) and train it to estimate the gradient of the high-resolution target \(\nabla t_k\). This method has two benefits: First, we get more accurate estimates for the depth discontinuities than what we would get from the gradient of the high-resolution estimate \(g(w_g, s_k)\). Secondly, the joint training of both objectives in a single deep network improves the performance of both tasks, because the weights \(w_h\) and \(w_g\) share the majority of parameters and only the parameters of the last layer, the output, differ. A graphical depiction of our deep network parametrization is shown in Fig. 2.

3.3 Training

In the previous sections we presented the description of our model. In this section we detail how we train it given a large set of training samples \(\left\{ (s_k, t_k) \right\} _{k=1}^K\). The training procedure is two-fold: In a first step we initialize the deep convolutional network, i.e. the functions g and h. Therefore, we train the network by mini-batch gradient descent with momentum term on the following loss function:

$$\begin{aligned} L_p(\left\{ (s_k, t_k) \right\} _{k=1}^K) = \frac{1}{K} \sum _{k=1}^K\left||g(w_g, s_k) - t_k\right||_2^2 + \left||h(w_h, s_k) - \nabla t_k\right||_2^2. \end{aligned}$$
(14)

In the following evaluations we set the learning rate to 0.001 and the momentum parameter to 0.9 for the initializing of the network. With this setting we train the network for 30 epochs on non-overlapping patches of size \(32 \times 32\) pixel.

In the second step of the training procedure we add the unrolled primal-dual optimization algorithm as introduced in Sect. 3.1 on top of the network. Then, we train the joint model end-to-end on the Euclidean loss stated in Eq. (1) with mini-batch gradient descent. We set the learning rate to 0.001 and the momentum parameter 0.9 to train the whole model for 5 epochs on non-overlapping patches of size \(128 \times 128\) pixel. In contrast to the method of implicit differentiation [31, 32], our method is still robust if we use a high learning rate, and as a consequence converges in fewer training iterations. Further, it enables us to optimize the parameter \(w_\lambda \), as well ass all hyper-parameters of the optimization procedure.

4 Evaluation

In this section we present an exhaustive experimental evaluation of the proposed ATGV-Net. First, we show how we generate a huge amount of training data with accurate ground-truth needed to train the deep network. Then, we demonstrate evaluation results on four standard benchmark datasets for depth map super-resolution: Following [1, 13, 20], we evaluate our method on the noise-free Middlebury disparity maps Teddy, Cones, Tsukuba and Venus. Additionally, we show results for the Laserscan dataset as proposed in [1]. In a second evaluation we compare our results on the noisy Middlebury 2007 dataset as proposed in [29] and finally, we demonstrate the real-world applicability of our method on the challenging ToFMark dataset [12].

We set the initial parameters of our model to \(\alpha _1=17\), \(\alpha _0=1.2\) for the regularization term, \(\beta =9\), \(\gamma =0.85\) for the anisotropic diffusion tensor, and \(w_\lambda =0.01\) for all experiments. Further, we fix the number of iterations of the primal-dual algorithm to 10.

Fig. 3.
figure 3

Examples of our generated depth maps. (a) visualizes the high-resolution ground-truth data. By resampling those depth maps with a scale factor \(\rho \) and adding depth dependent noise we create the low-resolution input. (b) shows the mid-resolution input, which is the bilinear upsampled low-resolution data. Best viewed magnified in the electronic version.

4.1 Training Data

One challenge in training very deep networks is the need for a huge amount of training data. In [1, 13] the authors use a small set, i.e. 31 depth maps, of synthetic rendered images for training and in [32] the authors trained and tested their method on the synthetic New Tsukuba dataset [26]. Only very recently larger datasets with accurate depth maps have been released [17], or have been added to existing benchmarks [4]. In our method we also make use of synthetically rendered data, but produce them in a much larger quantity.

For this purpose we implemented a ray-caster [2] that runs on the GPU and enables us to generate thousands of synthetic depth maps of high quality in a few minutes. For each image we randomly place between 24 and 42 rectangular cuboids and up to 3 spheres in a predefined volume. Further, we randomly scale and rotate each solid to achieve an infinitely number of possible constellations. Then, we place a virtual camera at the origin of the coordinate system and cast a ray for each pixel of the camera image. For each ray we compute the distance between the image plane and the closest surface it hits, or in the case it does not hit any surface, we return a maximum distance value for the background. In Fig. 3 we illustrate two random examples of the more than 40,000 depth maps that we have generated with this method.

Given this generated depth maps as noise free ground-truth, we create the low-resolution depth maps \(s_k^\text {(lr)} =\, \downarrow _\rho t_k\) for the network training by resampling the generated ground-truth depth maps \(t_k\) by the scale factor of \(\rho \) that is used in the evaluation. Depending on the dataset, we additionally add depth-dependent noise \(\eta (s_k^\text {(lr)})\) to the low-resolution depth map. Finally, we upsample this low-resolution, probably noisy depth maps with bilinear interpolation to obtain our mid-level representation \(s_k=\, \uparrow _\rho \!(s_k^\text {(lr)} + \eta (s_k^\text {(lr)}))\).

Table 1. Results on the noise-free Middlebury and Laserscan data. We report the error as root mean squared error (RMSE) in pixel disparity for the Middlebury data and in mm for the Laserscan data, respectively. We highlight the best result in boldface and the second best in italic.

4.2 Clean Middlebury and Laserscan

In this first experiment we evaluate the performance of our proposed method on the images Teddy, Cones, Tsukuba and Venus of the Middlebury dataset as in [1, 13, 20]. The disparity is interpreted as depth and we test upsampling factors of \(\times 2\) and \(\times 4\). Additionally, we evaluate on the Laserscan dataset images Scan21, Scan30 and Scan42 with an upsampling factor of \(\times 4\) as in [1, 13]. We compare our results to simple upsamling methods, such as nearest neighbor and bicubic upsampling, as well as to state-of-the-art depth upsampling methods that rely on an additional guidance image as input [9, 12]. Further, we show the results of recent sparse coding based approaches for single image super-resolution [36, 42], two approaches based on a Markov Random Field [1, 20] and a recent variational approach that uses sparse coding to estimate edge priors [13]. To demonstrate the effect of our variational model on top of the deep network, we show the results of the high-resolution estimates of the network only (CNN only), the results, where we add the variational model, but without joint training (CNN + ATGV-L2), and the results after end-to-end training (ATGV-Net).

The results in terms of the root mean squared error (RMSE) are summarized in Table 1.Footnote 1 We can clearly see that the deep network already achieves a significant performance improvement compared to the other methods on both datasets and upsampling factors. Interestingly, we obtain even better results as the methods [9, 12] that utilize an additional guidance image for the upsampling. This is especially pronounced in test samples with structures that are well simulated in the training data, such as Venus. Further, the variational model on top of the network slightly increases the performance and training the whole model end-to-end gives the overall best results. One exception is the Tsukuba sample, where the results get slightly worse after end-to-end training. An explanation might be that fine, elongated structures, e.g. near at the lamp of Tsukuba, are not well represented in the training data. In the qualitative results, see Fig. 4, we can further observe that the deep network with 10 layers achieves already very good results with sharper depth discontinuities compared to other methods. However, the improvement of the variational model on top of the deep network is hardly visible. This becomes more apparent in the next experiment.

Fig. 4.
figure 4

Qualitative results for the noise-free Middlebury image Tsukuba, \(\rho = 4\). (a) depicts the ground-truth and the input data. (b) and (c) show the results of state-of-the-art methods. (d) and (e) present the results of the deep network only and our proposed model trained end-to-end. Best viewed magnified in the electronic version.

4.3 Noisy Middlebury

Table 2. Results on noisy Middlebury data. We report the error as RMSE in pixel disparity and highlight the best result in boldface and the second best in italic.

In this experiment we evaluate our method on the Middlebury disparity maps Art, Books and Moebius with added depth dependent Gaussian noise to simulate the acquisition process of a Time-of-Flight sensor, as proposed by Park et al.  [29]. Therefore, we add to our low-resolution synthetic training data \(s_k^\text {(lr)}\) depth dependent Gaussian noise of the form \(\eta (x) = \mathcal {N}(0, \sigma s_k^\text {(lr)}(x)^{-1})\), with \(\sigma = 651\). Exemplar training images are depicted in Fig. 3. We report quantitative results in Table 2 and visualize qualitative results in Fig. 5.

We again compare our method to simple upsampling methods, such as nearest neighbor and bilinear interpolation. We compare our proposed method to other approaches that utilize an additional intensity image as guidance. Those methods include the Markov Random Field based approach in [9], the bilateral filtering with cost volume in [41], the guided image filter in [18], the noise-aware bilateral filter in [7], the non-local means filter in [29] and the variational model in [12]. To evaluate the influence of the variational model on top of the deep network, we report the results of the network only (CNN only), results with the variational model on top of the network, but without joint training (CNN + ATGV-L2), and the results after end-to-end training (ATGV-Net).

From the quantitative results in Table 2 we observe that the CNN only already performs better than state-of-the-art methods that utilize an additional guidance input for most images and upsampling factors. Further, the variational model on top of the deep network slightly improves the results, but end-to-end training of the whole model results in significant improvement. This improvement of ATGV-Net over the network only is also apparent in the qualitative results (Fig. 5). We observe less noise in homogeneous areas in the ATGV-Net estimates, especially in the background, compared to the CNN only estimates. The results of [12] look also very sharp, but produce errors near depth discontinuities and in-between fine structures. In contrast, our method preserves those finer structures. We refer to the supplemental material for additional qualitative results, as well as quantitative results in terms of mean absolute error (MAE).

4.4 ToFMark

In our final experiment we evaluate our method on the challenging ToFMark dataset [12]. This dataset consists of three time-of-flight (ToF) depth maps of three different scenes. For each scene there exists an accurate high-resolution structured-light scan as ground-truth. The ToF depth maps have a resolution of \(120 \times 160\) pixel and the target resolution, given by the guidance intensity image (that we do not use in our method) is \(610 \times 810\) pixel. This corresponds to an upsampling factor of approximately \(\rho = 5\). As the target high-resolution depth-map is given in the camera coordinate system of the structured light scanner, we prepare our training data accordingly. We project our high-resolution synthetic training depth maps to the ToF coordinate system using the provided projection matrix. In the low-resolution depth maps we add depth dependent noise and back project the remaining points to the target camera coordinate system. This yields a very sparse depth map that we subsequently inpaint with bilinear interpolation to obtain our final mid-resolution training inputs.

We compare our results to simple nearest neighbour and bilinear interpolation, and three state-of-the-art depth map super-resolution methods that utilize an additional guidance image as input. The quantitative results are shown in Table 3 as RMSE in mm. Please see the supplemental material for qualitative results. Even on this difficult dataset we are at least on par with state-of-the-art methods that utilize an additional intensity image as guidance input.

Fig. 5.
figure 5

Qualitative results for the noisy Middlebury image Moebius, \(\rho = 4\). (a) depicts the ground-truth and the input data. (b) and (c) show the results of state-of-the-art methods. (d) and (e) present the results of the deep network only and our proposed model trained end-to-end. Best viewed magnified in the electronic version.

Table 3. Results on real Time-of-Flight data from the ToFMark benchmark dataset. We report the error as RMSE in mm and highlight the best result in boldface and the second best in italic.

5 Conclusion

We presented a combination of a deep convolutional network with a variational model for single depth map super-resolution. We designed the convolutional network to compute the high-resolution depth map, as well as the depth discontinuities. The network output was utilized in our variational model to further refine the result. By unrolling the optimization procedure of the variational model, we were able to optimize the joint model end-to-end, which lead to improved accuracy. Further, we demonstrated the feasibility to train our method on a massive amount of synthetic generated depth data and obtain state-of-the-art results on four different benchmarks. Our model is especially useful if the low-resolution depth map contains noise, which is the case for most consumer depth sensors. In future work we plan to extend our model to depth data that contain larger areas of missing pixels, e.g. from structured light sensors. This is straight-forward by setting \(w_\lambda = 0\) for areas where depth measurements are missing.