1 Introduction

The use of depth information of a scene is essential in many applications such as autonomous navigation, 3D reconstruction, human-computer interaction and virtual reality. The introduction of low-cost depth camera facilitates the use of depth information in our daily life. However, the resolution of depth maps which is provided in a low-cost depth camera is generally very limited. To facilitate the use of depth data, we often need to address an upsampling problem in which the corresponding high-resolution (HR) depth map is recovered from a given low-resolution (LR) depth map.

Depth map super-resolution is a non-trivial task. Specifically, fine structures in HR image are either lost or severely distorted (depending on the scale factor used) in LR image because they cannot be fully represented by the limited spatial resolution. A brute-force upsampling of LR image simply causes those structures which are supposed to have sharp boundaries become blurred in the upsampled image. Ambiguity in super-resolving the severely distorted fine structures often exists, especially for the case of single-image upsampling. Figure 1(c–d) demonstrates the upsampling ambiguity problem.

To address the aforementioned problem, a corresponding intensity imageFootnote 1 is often used to guide the upsampling process [17] or enhance the low-quality depth maps [810]. This is due to the fact that a correspondence between an intensity edge and a depth edge can be most likely established. Since the intensity image is at a higher resolution, its intensity discontinuities can be used to locate the associated depth discontinuities in a higher resolution. Although there could be an exception that an intensity edge does not correspond to a depth edge or vice versa, this correspondence assumption has been used widely in the literature.

Fig. 1.
figure 1

Ambiguity in upsampling depth map. (a) Color image. (b) Ground truth. (c) (Enlarged) LR depth map downsampled by a factor of 8. Results for upsampling: (d) SRCNN [11], (e) Our solution without ambiguity problem.

Fig. 2.
figure 2

Over-texture transfer in depth map refinement and upsampling using intensity guidance. (a) Color image. (b) Ground truth. (c) Refinement of (b) using (a) by Guided Filtering [8] (\(r = 4, \epsilon = 0.01^{2}\)). Results of using (a) to guide the \(2\times \) upsampling of (b): (d) Ferstl et al. [4], (e) Our solution.

One would encounter issues too in exploiting the intensity guidance. Specifically, suppose we have a perfectly registered pair of depth map D and intensity image Y possessing the same resolution. It is not straight forward to use Y to guide the refinement of D or the upsampling of LR D. The variation of depth structures in D may not be consistent with that of the intensity structures in Y as they are different in nature. Using image-guided filtering, features in intensity images are often over-transferred to the depth image at the boundaries between textured and homogeneous regions. Figure 2(c–d) illustrates two examples for the over-texture transferring problem. Our proposed method that complements D with only consistent structures from Y can avoid this problem (Fig. 2(e)).

In this paper, we present a novel end-to-end upsampling network, a Multi-Scale Guided convolutional network (MSG-Net), which learns HR features in the intensity branch and complements the LR depth structures in the depth branch to overcome the aforementioned problems. MSG-Net is appealing in that it allows the network to learn rich hierarchical features at different levels. This in turn makes the network to better adapt for upsampling of both fine- and large-scale structures. At each level, the upsampling of LR depth features is closely guided by the associated HR intensity features possessing the same resolution. The integrated multi-scale guidance progressively resolves ambiguity in depth map upsampling. We further present a high-frequency training approach to reduce training time and facilitate the fusion of depth and intensity features. Note that unlike existing super-resolution networks [11, 12] that require pre-upsampling of input image by a conventional method such as bicubic interpolation outside the network. Our approach learns upsampling kernels inside a network to fully explore the upsampling ability of a CNN. We show that such a multi-scale upsampling method uses a more effective way to upscale LR images, while capable of exploiting the guidance from HR intensity features seamlessly.

Contributions: (1) We propose a new framework to address the problem of depth map upsampling by complementing a LR depth map with the corresponding HR intensity image using a convolutional neural network in a multi-scale guidance architecture (MSG-Net). To the best of our knowledge, no prior studies have proposed this idea for CNN before. (2) With the introduction of multi-scale upsampling architecture, our compact single-image upsampling network (MS-Net) in which no guidance from HR intensity image is present already outperforms most of the state-of-the-art methods requiring guidance from HR intensity image. (3) We discuss detailed steps to enable both MSG-Net and MS-Net to perform image-wise upsampling and end-to-end training.

2 Related Work

There is a variety of methods to perform image super resolution in the literature. Here, we categorize them into four groups:

Local methods are based on filtering. Yang et al. used the joint bilateral filter [1] to weight the degree of smoothing in each depth patch by considering the color similarity between the center pixel and its neighborhood [13]. Liu et al. designed the upsampling weights using geodesic distances [14]. With the use of image segmentation, Lu et al. developed a smoothing method to reconstruct depth structures within each segment [6].

Global methods formulate depth upsampling as an optimization problem where a large cost is given to a pixel in depth map if neighboring depth pixels have similar color in the associated intensity image but different depth values. Diebel et al. proposed Markov Random Field (MRF) formulation, which consists of a data term from LR depth map and a smoothness term from the corresponding HR intensity image for depth upsampling [15]. Park et al. utilized nonlocal means filtering in which intensity features are acted as weights in depth regularization [2]. Ferstl et al. used an anisotropic diffusion tensor to regularize depth upsampling [4]. Yang et al. developed an adaptive color-guided auto regression model for depth recovery [5]. Aodha et al. especially focused on single-image upsampling as MRF labeling problem [16].

Dictionary methods exploit the relationship between a paired LR and HR depth patches through sparse coding. Yang et al. sought the coefficients of this representation to generate HR output [17]. Timofte et al. improved sparse-coding method by introducing the anchored neighborhood regression [18]. Ferstl et al. proposed to learn a dictionary of edge priors for an anisotropic guidance [19]. Li et al. proposed a joint examples-based upsampling method [20]. Kwon et al. formulated an upscaling problem which consists of scale-dependent dictionaries and TV regularization [7].

CNN-based methods are in distinction to dictionary-based approaches in that CNN do not explicitly learn dictionaries. With the motivation from convolutional dictionaries [21], Osendorfer et al. presented a convolutional sparse coding method for super-resolving images [22]. Wang et al. developed a cascade of sparse coding based networks (CSCN) [12] that are constructed by using modules from the network for the learned iterative shrinkage and thresholding algorithm (LISTA) [21]. However, their decoder uses sparse code to infer a HR patch separately. All the recovered patches are required to put back to the corresponding positions in HR image. Dong et al. proposed an end-to-end super-resolution convolutional neural network (SRCNN) to achieve image restoration [11].

Comparing to the above methods, our CNNs exhibit several advantages. We do not explicitly formulate an optimization problem as the global methods [2, 4, 5, 15] or design a fixed filter as the local methods [6, 13, 14] because CNN can be trained to address the upsampling problem. In contrast to the dictionary methods [7, 19], our networks are self-regularized. No extra regularization on the upsampled image is necessary outside the network. In distinction to other single-image super resolution CNNs [11, 12, 22], our networks do not use a single fixed (non-trainable) upsampling operator. More importantly, our MSG-Net is specifically designed for image-guided depth upsampling. Rich hierarchical features in the HR intensity image are learned to guide the upsampling of the LR depth map progressively in multiple levels towards the desired HR depth map. The multi-scale fusion architecture in turn enables MSG-Net to achieve high-quality upsampling performance especially at large upscaling factors.

Our work is related to the multi-scale CNNs for semantic segmentation (FCN) [23], inferring images of chairs [24], optical flow generation (FlowNet) [25] and holistically-nested edge detection (HED) [26]. Our network architecture differs from theirs significantly. An upsampling network is used in [24]. A downsampling network is used in HED. A downsampling sub-network followed by an upsampling sub-network is used in FlowNet and FCN. We use an upsampling (depth) branch in parallel with a downsampling (intensity) branch. This network architecture has not been studied yet. In common to [2325], we use multiple backwards convolutions for upsampling. But we do not use feed-fowarding and unpooling. All the above networks do not use deep supervision except HED.

3 Intensity-Guided Depth Map Upsampling

Suppose we have a LR depth map \(D_{l}\) which is down-sampled from its HR counterpart \(D_{h}\). Additionally, a corresponding HR intensity image \(Y_{h}\) of the same scene is available. Our goal is to recover \(D_{h}\) using \(D_{l}\) and \(Y_{h}\).

We first present some insights about the upsampling architecture. These motivate us on the design of our proposed upsampling CNNs.

Spectral Decomposition. We have observed that simple upsampling operator like bicubic interpolation performs very well in smooth region, but sharpness is lost along edges. Unlike SRCNN [11] and CSCN [12], we do not enlarge \(D_{l}\) using a fixed upsampling operator and then refine the enlarged \(D_{l}\) afterwards. To achieve optimal upsampling, we believe that different spectral components of \(D_{l}\) need to be upsampled using different strategies because a single upsampling operator is unlikely to be suitable for upsampling of all kinds of structures.

Multi-scale Upsampling. Multi-scale representation has played an important role in the success of addressing low-level problems like motion-depth fusion [27], optical flow generation [23] and depth map recovery [7]. Different structures in an image have different scales. A multi-scale upsampling CNN that allows the use of scale-dependent upsampling kernels can greatly improve the quality of the recovered HR image especially at large upscaling factors.

3.1 Formulation

We design MSG-Net to upsample a LR image \(D_{l}\) not in a single level but progressively in multiple levels to a desired HR image \(\widetilde{D_{h}}\) with multi-scale guidance from the corresponding HR intensity image \(Y_{h}\). We upsample \(D_{l}\) in m levels for the upscaling factor \(2^{m}\). Figure 3 shows an overview of the network architecture. It consists of five stages, namely feature extraction (each for Y- and D-branches), downsampling, upsampling, fusion and reconstruction. We will discuss the details of each stage in this section.

Fig. 3.
figure 3

The architecture of MSG-Net. For the ease of representation, only an upsampling CNN with upscaling factor 8 is presented. There are three multi-scale upsampling levels. Each level consists of an upsampling and a fusion stage.

Overview. It is not possible to determine the absolute depth value of a pixel from an intensity patch alone as it is an ill-posed problem. Flat intensity patches (regardless of what intensity values they possess) do not contribute much improvement in depth super resolution. Therefore, we complement depth features with the associated intensity features in high-frequency domain. In other words, we perform an early spectral decomposition of \(D_{l}\): \(D_{l} = l(D_{l}) + h(D_{l})\). By using the high-frequency (h) components of both Y and D images as the inputs, this gives room for the network to focus on structured features for joint upsampling and filtering. This in turn improves the upsampling performance greatly. We have also experienced a reduction in the convergence time if the network are trained in high-frequency domain. We obtain the high-frequency components of \(D_{l}\), \(D_{h}\), and \(Y_{h}\) by applying a low-pass filter \(\mathbf{W}_{l}\) to them as follows:

$$\begin{aligned} h(D_{l})&= D_{l} - \mathbf{W}_{l} *D_{l}, \end{aligned}$$
(1.1)
$$\begin{aligned} h(D_{h})&= D_{h} - (\mathbf{W}_{l} *D_{l})^{\uparrow D_{h}}, \end{aligned}$$
(1.2)
$$\begin{aligned} h(Y_{h})&= Y_{h} - \mathbf{W}_{l} *Y_{h}, \end{aligned}$$
(1.3)

where \((I_{l})^{\uparrow D_{h}}\) performs a bicubic upsampling on \(I_{l}\) to the same resolution as \(D_{h}\).

Suppose the upscaling factor is \(s = 2^{m}\), then there are M layers (including m upsampling levels) in the main branch and 2m layers in the Y branch. MSG-Net can be expressed as follows:

$$\begin{aligned} F^{Y}_{1}&= \sigma (\mathbf{W}^{Y}_{c(1)}*h(Y_{h}) + \mathbf{b}^{Y}_{1}), \text { (feature extraction)} \end{aligned}$$
(2.1)
$$\begin{aligned} F^{Y}_{j}&= \sigma (\mathbf{W}^{Y}_{c(j)}*F^{Y}_{j-1}+ \mathbf{b}^{Y}_{j}), \text { (post-feature extraction)} \end{aligned}$$
(2.2)
$$\begin{aligned} F^{Y}_{2j'}&= maxpool(F^{Y}_{2j'-1}), \text { (downsampling)} \end{aligned}$$
(2.3)
$$\begin{aligned} F_{1}&= \sigma (\mathbf{W}_{c(1)}*h(D_{l}) + \mathbf{b}_{1}), \text { (feature extraction)} \end{aligned}$$
(2.4)
$$\begin{aligned} F_{k}&= \sigma \left( \mathbf{W}_{d(k)}\star F_{k-1} + \mathbf{b}_{k}\right) , \text { (upsampling)} \end{aligned}$$
(2.5)
$$\begin{aligned} F_{k+1}&= \sigma \left( \mathbf{W}_{c(k+1)}*\left( F^{Y}_{2(m+1-k/3)}, F_{k}\right) + \mathbf{b}_{k+1}\right) , \text { (fusion)} \end{aligned}$$
(2.6)
$$\begin{aligned} F_{k+2+k'}&= \sigma \left( \mathbf{W}_{c(k+2+k')}*F_{k+1+k'} + \mathbf{b}_{k+2+k'}\right) , k' \in \{0, 1\}\text { (post-fusion)} \end{aligned}$$
(2.7)
$$\begin{aligned} F_{M}&= h(\widetilde{D_{h}}) = \mathbf{W}_{c(M)}*F_{M-1} + \mathbf{b}_{M}, \text { (reconstruction)} \end{aligned}$$
(2.8)
$$\begin{aligned} \widetilde{D_{h}}&= h(\widetilde{D_{h}}) +(\mathbf{W}_{l} *D_{l})^{\uparrow \widetilde{D_{h}}}, \text { (post-reconstruction)} \end{aligned}$$
(2.9)

where \(j = \{2, 3, 5, \ldots , 2m-1\}\), \( j'= \{4, 6,..., 2m\}\), \(k = \{2, 5, \ldots , 3m-1\}\) and \(M = 3(m + 1)\). The operators \(*\) and \(\star \) represent convolution and backwards convolution respectively. Vectors (or blobs) having superscript Y in (2) belongs to HR intensity (Y) branch of MSG-Net. \(\mathbf{W}_{c(i)/d(i)}\) is a kernel (subscripts c and d stand for convolution and deconvolution respectively) of size \(n_{i-1} \times f_{i} \times f_{i} \times n_{i}\) (\(n_{i-1}\) and \(n_{i}\) are the numbers of feature maps in the \((i-1)^{th}\) and \(i^{th}\) layers, respectively) and \(\mathbf{b}_{i}\) is a \(n_{i}\)-dimensional bias vector (it is a scalar in the top layer). Each layer is followed by an activation function for non-linear mapping except the top layer. We use parametric rectified linear unit (PReLU) [28] as the activation function \((\sigma )\) due to its generalization and improvement in model fitting, where \(\sigma (y) = \max (0, y) + a\min (0, y)\) and a is a learnable slope coefficient for negative y.

Denote F as our overall network architecture for MSG-Net and \({\varTheta } = \left\{ \mathbf{W}, \mathbf{b}, \mathbf{a}\right\} \) as the network parameters controlling the forward process, we train our network by minimizing the mean squared error (MSE) for N training samples as follows:

$$\begin{aligned} L(\varTheta ) = \frac{1}{N}\sum \nolimits _{i=1}^N ||F(h(Y_{h(i)}), h(D_{l(i)}); \varTheta ) - h(D_{h(i)})||^{2}. \end{aligned}$$
(3)

The loss is minimized using stochastic gradient descent.

Feature Extraction. MSG-Net first decomposes a LR high-frequency depth map \(h(D_{l})\) and the associated HR high-frequency image \(h(Y_{h})\) into different spectral components (sub-bands) at the bottom layer and the first two layers of the D- and Y-branches respectively. This facilitates the network to learn for scale-dependent and spectral-dependent upsampling operators afterwards.

Multi-scale Upsampling. We perform upsampling in m levels. Backwards convolution (or so-called deconvolution) (deconv) in the \(i^{th}\) layer is used to upsample the sub-bands \(F_{i-1} = \{f_{(i-1, j)}, j = 1, \ldots , n_{i-1}\}\) in the \((i-1)^{th}\) layer. Each deconv layer has a set of trainable kernels \(\mathbf{W}_{d(i)} = \{\mathbf{w}_{d(i,j)}, j = 1, \ldots , n_{i}\}\) such that \(\mathbf{w}_{d(i,j)} = \{w_{d(i,j,k)}, k = 1, \ldots , n_{i-1}\}\) and \(w_{d(i,j,k)}\) is a \(f_{i} \times f_{i}\) filter. \(\mathbf{Deconv}\) recovers the \(j^{th}\) HR sub-band in the \(i^{th}\) layer by utilizing the dependency across all LR sub-bands in the \((i-1)^{th}\) layer as follows:

$$\begin{aligned} f_{(i, j)} = \sum \nolimits _{k=1}^{n_{i-1}}w_{d(i,j,k)} \star f_{(i-1,k)} + b_{(i,j)}. \end{aligned}$$
(4)

More specifically, each element in a HR sub-band is constructed by element-wise summation of a corresponding set of enlarged blocks of pixels across all the LR sub-bands in the previous layer. Suppose a stride s is used, each enlarged block of pixels is centered in a 2D regular grid with length s.

Fischer et al. [25] and Long et al. [23] proposed to feed-forward and concatenate feature maps from lower layers. MSG-Net uses a more effective design. We directly enlarge feature maps which originate from the previous layer without feed-forwarding. Unlike the “unpooling + convolution” (uconv) layer introduced by Dosovitskiy et al. [24], our upsampling uses backwards convolution in which it diffuses a set of feature maps to another set of larger feature maps. The diffusion is governed by the learned deconv filters but not simply filling zeros. More importantly, uconvs are used in their networks to facilitate the transformation from a high-level representation generated by multiple fully-connected (FC) layers to two images but not to upsample a given LR image.

To compromise both computational efficiency and upsampling accuracy, we set \(f_{i}\) for \(\mathbf{W}_{d(i)}\) to be \(2s + 1\). Having such a kernel size ensures that all the inter-pixels between the demultiplexed pixels in each feature map are completely covered by deconv filter \(\mathbf{W}_{d}\). We observed that \(\mathbf{W}_{d}\) with a size larger than \((2s + 1) \times (2s+1)\) does not bring significant improvement.

Downsampling. The associated HR intensity image \(Y_{h}\) posses the same resolution as HR depth map \(D_{h}\). In our design, \(D_{l}\) is progressively upsampled by a factor of 2 in a multi-scale manner. In order to match the size of the feature maps for D and Y, we progressively downsample the feature maps extracted from \(h(Y_{h})\) in the reverse pace by a convolution followed by a \(3 \times 3\) maximum pooling with stride = 2. Downsampling of feature maps in Y-branch can also be achieved by using a \(3 \times 3\) convolution with stride = 2. The resulting CNN performs slightly poorer than the one using pooling.

Fusion. The upsampled feature maps \(F_{k}\) are complemented with the corresponding feature maps \(F^{Y}_{2(m+1-k/3)}\) in Y-branch possessing the same resolution. The fusion kernel \(\mathbf{W}_{c(k+1)}\) in (2.6) constructs a new set of sub-bands by fusing the local features in the vicinity defined by \(\mathbf{W}_{c(k+1)}\) across all the sub-bands of \(F_{i}\) and \(F^{Y}_{2(m+1-k/3)}\). As intensity features in \(Y_{h}\) may not be consistent with depth structures in \(D_{h}\), a post-fusion layer is introduced to learn a better coupling. An extra post-fusion layer is included for an enhanced fusion before reconstruction.

Reconstruction. The enlarged feature maps from the previous upsampling levels are generally “dense” in nature. Due to spectral decomposition, the energy (i.e. intensity) of each pixel in an image is distributed across different spectral components. Reconstruction layer combines \(n_{M-1}\) upsampled sub-bands and recovers a HR image. Finally, we convert the recovered HR \(h(\widetilde{D_{h}})\) from high-frequency domain back to an ordinary HR depth map \(\widetilde{D_{h}}\) by a post-reconstruction step in (2.9). This is achieved by using the upsampled low-frequency image \((\mathbf{W}_{l} *D_{l})^{\uparrow \widetilde{D_{h}}}\) in (1.2) as the missed low-frequency component for \(\widetilde{D_{h}}\).

Fig. 4.
figure 4

The network architecture of MS-Net for single-image super resolution. For the ease of representation, only a 8\(\times \) upsampling CNN is presented.

3.2 A Special Case: Single-Image Upsampling

Removing the (intensity) guidance branch and fusion stages of MSG-Net, it reduces to a compact multi-scale network (MS-Net) for super-resolving images by sacrificing some upsampling accuracy. Figure 4 illustrates its network architecture. MS-Net is used for single-image super resolution. It consists of three stages, namely feature extraction, multi-scale upsampling and reconstruction. For an upscaling factor \(s = 2^{m}\), there are only \((m + 2)\) layers. MS-Net can be expressed as follows:

$$\begin{aligned} F_{1}&= \sigma (\mathbf{W}_{c(1)}*h(D_{l}) + \mathbf{b}_{1}), \text { (feature extraction)} \end{aligned}$$
(5.1)
$$\begin{aligned} F_{i}&= \sigma (\mathbf{W}_{d(i)}\star F_{i-1} + \mathbf{b}_{i}), i = 2, ..., M-1, \text { (upsampling)} \end{aligned}$$
(5.2)
$$\begin{aligned} F_{M}&= h(\widetilde{D_{h}}) = \mathbf{W}_{c(M)}*F_{M-1} + b_{M}, \text { (reconstruction)} \end{aligned}$$
(5.3)
$$\begin{aligned} \widetilde{D_{h}}&= h(\widetilde{D_{h}}) +(\mathbf{W}_{l} *D_{l})^{\uparrow \widetilde{D_{h}}}. \text { (post-reconstruction)} \end{aligned}$$
(5.4)

Denote F as our overall network architecture for MS-Net and \({\varTheta } = \left\{ \mathbf{W}, \mathbf{b}, \mathbf{a}\right\} \) as the network parameters controlling the forward process, we train our network by minimizing the mean squared error (MSE) for N training samples as follows:

$$\begin{aligned} L(\varTheta ) = \frac{1}{N}\sum \nolimits _{i=1}^N ||F\left( h\left( D_{l(i)}\right) ; \varTheta \right) - h\left( D_{h(i)}\right) ||^{2}. \end{aligned}$$
(6)

The loss is also minimized using stochastic gradient descent.

SS-Net vs MS-Net: Comparing the number of deconv parameters in the network using a single large-stride deconv layer (SS-Net) with that in a multi-scale small-stride deconv network (MS-Net), the number of deconv parameters for the latter one is indeed lower. Suppose all deconv layers in MS-Net have \(s = 2\), then there are only \(25\sum _{i=2}^{m+1}{n_{i-1}n_{i}}\) kernel parameters. If they all have the same number of feature maps i.e. \(n_{1} = n_{2} = ... = n\), then there are \(25\,mn^{2}\) kernel parameters. For SS-Net, there are \((2^{m+1}+1)^2n^{2}\) kernel parameters.

4 Experiments

4.1 Training Details

We collected 58 RGBD images from MPI Sintel depth dataset [29], and 34 RGBD images (6, 10 and 18 images are from 2001, 2006 and 2014 datasets respectively) from Middlebury dataset [3032]. We used 82 images for training and 10 images for validation. We augmented the training data by a 90\(^{\circ }\)-rotation. The training and testing RGBD data were normalized to the range [0, 1].

Instead of using large-size images for training, sub-images were generated from them by dividing each image into a regular grid of small overlapping patches. This training approach does not reduce the performance of CNN but it leads to a reduction in training time [23]. We performed a regular sampling on the raw images with stride = {22, 21, 20, 24Footnote 2} for the scale \(= \{2, 4, 8, 16\}\) respectively. We excluded patches without depth information due to occlusion. There were roughly 190, 000 training sub-images. To synthesize LR depth samples \(\{D_{l}\}\), we first filtered each full-resolution sub-image by a 2D Gaussian kernel and then downsampled it by the given scaling factor. The LR/HR patches \(\{D_{l}\}\)/\(\{D_{h}\}\) (and \(\{Y_{h}\}\)) were prepared to have sizes \(20^{2}/39^{2}\), \(16^{2}/63^{2}\), \(12^{2}/95^{2}\), \(8^{2}/127^{2}\) for the upscaling factors 2, 4, 8, 16, respectively. We do not prefer to use a set of large-size sub-images for training upsampling networks with large upscaling factors (e.g. \(8 \times , 16 \times \)). We have experienced that using them cannot improve the training accuracy significantly. Moreover, this increases the computation time and memory burden for training.

It is possible to train MS-Net (but not MSG-Net) without padding as SRCNN [11] to reduce memory usage and training time. We have to pad zeros for convolution layers in MSG-Net so that the dimension of the feature maps in the intensity branch can match that in the depth branch. We need to crop the resulted feature maps after performing backwards convolution so that the reconstructed HR depth map \(\widetilde{D_{h}}\) is close to the desired resolutionFootnote 3. For consistency, we trained all our CNNs except SRCNN and its variant with a padding scheme.

We built our networks on top of the caffe CNN implementation [33]. CNNs were trained with smaller base learning rates for large upscaling factors. Base learning rates varied from 3e−3 to 6e−5 for MSG-Net and 4e−3 to 4e−4 for MS-Net. We chose momentum to be 0.9. Unlike SRCNN [11], we used stepwise decrease (5 steps with learning rate multiplier \(\gamma = 0.8)\) as the learning policy because we experienced that a lower learning rate usage in the later part of training process can reduce fluctuation in the convergence curve. We trained each MS-Net and MSG-Net for 5e+5 iterations. We set the network parameters: \(\mathbf{W}_{l} = \frac{1}{9}I_{3}\), \(f^{Y}_{1} = 7, n^{Y}_{1} = 49, n_{1} = 64\) and (\(f_{i} = 5\), \(n_{i} = 32)\) for other layers. We initialized all the filter weights and bias values as PReLU networks [28].

We trained a specific network for each upscaling factor \(s \in \{2, 4, 8, 16\}\). We adopted the following pre-training and fine-tuning scheme for MSG-Net: (1) we pre-trained the Y- and D- branches for a 2\(\times \) MSG-Net separately, (2) we transfered the first two layers of them (D-branch: {conv1, deconv2} and Y-branch: {conv1Y, conv2Y}) to a plain 2\(\times \) MSG-Net and then fine-tuned it. For training MSG-Net with other upsampling factors (\(2^{m}, m > 1\)), we transfered all the layers except the last four layers in the D-branch from the network trained with upsampling factor \(2^{m-1}\) to a plain network and then fine-tuned it. We trained SRCNNs for different upscaling factors using the same strategy as recommended by the authors [11]. We also modified SRCNN by replacing the activation functions from ReLU to PReLU. We name this variant as SRCNN2.

Fig. 5.
figure 5

Visualization of the bottom-layer kernels for five CNNs trained for 8\(\times \) upsampling. Their kernel sizes are: \(9 \times 9\) for SRCNN and SRCNN2, \(5 \times 5\) for MS-Net, \(7 \times 7\) (Top: Y-branch), \(5 \times 5\) (Bottom: D-branch) for MSG-Net.

4.2 Analysis of the Learned Kernels

The bottom-layer filters of SRCNN which is trained for depth map upsampling are different than the one trained for image super resolution [11]. As shown in Fig. 5a, we can recognize some flattened edge-like and Laplacian filters. The filters near the right of second row are completely flat (or so-called “dead” filters). Figure 5b visualizes the filters of the trained SRCNN2. In comparison to SRCNN, SRCNN2 has sharper edge-like filters and fewer “dead” filters.

We trained MS-Net in two approaches: using ordinary and high-frequency (i.e. with early spectral decomposition) domains. As shown in Fig. 5c and d, we can recognize simple gradient operators such as horizontal, vertical and diagonal filters for both of the cases. When MS-Net is trained in ordinary domain, it first decomposes the components of LR depth map into a complete spectrum and performs spectral upsampling subsequently. By training MS-Net in high-frequency domain, all the bottom-layer kernels become high-pass filters. Similar patterned filters (bottom of Fig. 5e) are present in the first layer of the D-branch of MSG-Net as well. For the Y-branch, the learned filters (top of Fig. 5e) contain both textured and low-varying filters.

Table 1. Quantitative comparison (in RMSE) on dataset A.
Table 2. Quantitative comparison (in RMSE) on dataset B.

4.3 Results

We provide both quantitative and qualitative evaluations on our image-guided upsampling CNN (MSG-Net) and single-image upsampling CNN (MS-Net) to the state-of-the-art methods. We report upsampling performance in terms of root mean squared error (RMSE). We evaluate our methods on the hole-filled Middlebury RGBD datasets. We denote them as A [4], B [5] and C [19]. The RMSE values in TablesFootnote 4 12 and 3 for the compared methods are computed using the upsampled depth maps provided by Ferstl et al. [4], Yang et al. [5] and Ferstl et al. [19] respectively, except the evaluations for Kiechle et al. [3] and Wang et al. [12] (code packages provided by the authors), Lu et al. [6] (upsampled depth maps provided by the authors) and SRCNN(2) (trained by ourself). The best RMSE for each evaluation is in bold, whereas the second best one is underlined. Since the ground-truths are quantized to 8-bit, we convert all recovered HR depth maps in the same data type in order to have a fair evaluation. Following [6, 7, 19], we performed evaluation on dataset C only up to 8\(\times \) due to the low resolution (\(<450 \times 375\)) of the ground-truths.

As shown in the three tables, our single-image upsampling CNN (MS-Net) achieves state-of-the-art performance. SRCNN2 performs better than the original SRCNN due to the use of PReLU as the activation function. Although MS-Net and SRCNN(2) are both designed for single-image super resolution, MS-Net outperforms SRCNN(2). This is because MS-Net performs image upsampling but not image refinement as SRCNN(2). MS-Net (and also MSG-Net) are trained to learn different upsampling operators for different spectral components of LR depth map. They are not constrained only to a fixed non-trainable upsampling operator. The upsampling performance is further improved when MSG-Net upsamples LR depth map with the guidance from HR intensity image of the same scene. This in turn allows MSG-Net to outperform MS-Net. Figure 6 shows 8\(\times \) upsampled depth maps for different methods. It is observed that HR depth boundaries reconstructed by MSG-Net are sharper than the compared methods. The evaluations suggest that multi-scale guidance has played an important role in the success of depth map super resolution in MSG-Net.

Table 3. Quantitative comparison (in RMSE) on dataset C.
Fig. 6.
figure 6

Upsampled depth maps for dataset A. (a) Color image and ground-truth depth patches. Upsampled results from (b) Ferstl et al. [4], (c) Kiechle et al. [3], (d) SRCNN [11], and (e) MSG-Net.

Table 4. RMSE for different variants of MSG-Net with upscaling factor 8.
Table 5. Computation time (sec).
Fig. 7.
figure 7

Convergence curves.

The Role of Guidance. We evaluate several variants of MSG-Net at upscaling factor 8: (1) MS(woG)-Net (without Y-branch), (2) MSG(2, 4)-Nets (Intensity-guidance only applied at deconv(2, 4) respectively) and (3) MSG-Net(ord) (trained in ordinary domain). As summarized in Table 4, MSG-Net outperforms the others. Comparing to the partially guided variants MSG(2, 4)-Nets, MS(woG)-Net loses some upsampling performance due to the absence of guidance branch.

The Role of Multi-scale Upsampling. We consider the single-scale variant of MSG-Net: SSG-Net (deconv4 uses stride = 8, conv3Y - pool4Y in Y-branch and deconv2 - conv3_2 in D-branch are removed). As shown in Table 4, SSG-Net performs poorer than MSG-Net. This suggests that multi-scale architecture is necessary in guided upsampling.

Training in Frequency-Domain. As presented in Table 4 and Fig. 7, MSG-Net not only performs better than its ordinary-domain trained counterpart MSG-Net(ord) in upsampling accuracy but it also converges faster. The difference in the speed of convergence is more obvious between MS-Net and MS-Net(ord). This verifies our motivation in earlier section that using high-frequency domain can facilitate depth-intensity fusion and reduce training time.

Timings. We summarize the computation time for upscaling different LR depth maps Art to their full resolution (\(1376 \times 1088\)) using MS-Net and MSG-Net in Table 5. Upsamplings were performed in MATLAB with a TITAN X GPU.

5 Conclusion

We have presented a new framework to address the problem of depth map upsampling by using a multi-scale guided convolutional neural network (MSG-Net). A LR depth map is progressively upsampled with the guidance of the associated HR intensity image. Using such a design, MSG-Net achieves state-of-the-art performance for super-resolving depth maps. We have also studied a special case of it for multi-scale single-image super resolution (MS-Net) without guidance. Although sacrificing some upsampling performance, MS-Net in turn has a compact network architecture and it still achieves good performance.