Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Building 3D models of real scenes has been a longstanding goal of computer vision. While impressive results can be achieved with multi-view and video-based approaches [14], the progress of depth sensors and their decreasing prices make them an attractive alternative, able to capture 3D in a single shot [5]. Unfortunately, even the best depth sensors still provide imperfect measurements. In particular, these measurements are often sparse and contain large holes due to various factors, such as reflective surfaces or too-distant portions of the scenes.

Overcoming these limitations has therefore recently become a popular research topic. For instance, depth super-resolution [611] tackles the sparseness issue and attempts to densify the observed depth data. Typically, however, existing methods assume that the measurements are regularly spaced, and are thus ill-suited to handle large holes. By contrast, depth completion or inpainting [12, 13] are designed to handle irregular measurements and fill holes in the input depth maps by leveraging RGB image information, or fusing multiple depth measurements [14]. These methods, however, simply complete the observed data. As a consequence, they are ill-suited to build a model of a scene, where one is not interested in modeling the foreground objects. To address this problem, one should truly hallucinate the depth behind the observed foreground objects.

Only little work has been done to tackle the task of depth hallucination from a noisy depth map and its corresponding RGB image [12, 13, 15, 16], and existing methods typically work under additional assumptions. For example, [12, 13] rely on a user-defined foreground mask to hallucinate the background depth. The method in [15] relies on a layered depth model simply assuming that each layer is a smoothly varying surface, thus not considering semantics or image information. While [16] exploits image and semantics, it relies on CAD models to represent the foreground objects. Furthermore, both methods were designed for the indoor scenario, and are thus ill-suited to handle complex outdoor scenes.

By contrast, in this paper, we introduce a fully automatic approach to performing depth completion and hallucination for general (outdoor) scenes in a single shot. To this end, we develop a two-layer scene model accounting for the visible information and the hidden one. In each layer, we jointly estimate the depth and the semantics of the scene. Not only does this let us leverage depth to detect the foreground objects, but it also allows us to exploit the dependencies between depth and semantics to improve completion and hallucination. As evidenced by Fig. 1, our approach lets us accurately fill the large holes in the input depth maps, segment the different kinds of objects observed in the scene, and hallucinate the depth and semantics behind the foreground objects.

Specifically, we rely on the assumptions that depth is piecewise planar, semantics piecewise constant, and that the discontinuities of both modalities should largely coincide. We show that these assumptions can be formalized with a single Mumford-Shah functional. We then formulate the task of jointly completing and hallucinating depth and semantics as a discrete-continuous optimization problem whose variables encode a foreground-background mask and two layers of depth and semantics information: one for the data that is visible in the image/depth map and one for the data that is hidden behind the foreground. Following an alternating optimization strategy, we show that each type of variables has an elegant solution; the discrete ones can be computed via simple thresholding, and the continuous ones via a primal-dual algorithm implemented on the GPU. Altogether, this provides us with an effective framework to build scene models from a single noisy depth map and its corresponding RGB image despite the presence of undesirable foreground objects.

We demonstrate the effectiveness of our approach on two datasets, i.e., KITTI [17] and Stixel [18]. Our experiments evidence that our method can produce accurate models of complex outdoor scenes without requiring any manual intervention. This, we believe, constitutes a significant step towards making 3D scene modeling in real, dynamic environments practical.

Fig. 1.
figure 1

Our approach. Given an input RGB image and a noisy, incomplete depth map, we complete and hallucinate depth and semantic to produce a complete scene model. First row: Input RGB image, incomplete depth measurements, estimated semantics; Second row: completed depth for the visible layer, hallucinated depth and semantics for the hidden layer. (Color figure online)

2 Related Work

With access to depth sensors becoming easier everyday, increasingly many methods rely on depth as input for various applications, such as autonomous driving [17], augmented reality [19] and personal robotics [20]. Unfortunately, depth sensors are not perfect; they typically produce relatively sparse measurements with large holes.

Depth super-resolution attempts to overcome the sparseness issue by generating a high-resolution depth map from a low-resolution one. This is typically achieved via Markov Random Fields [6, 7, 12], bilateral filtering [21], layered representations [22], patch-based approaches [10, 11], or depth transfer [8, 9]. These approaches, however, inherently assume to have access to regularly-spaced depth measurements, and thus cannot handle large holes in depth maps.

By contrast, depth completion techniques have been designed to work with irregular measurements and to fill in large holes. In this context, Liu et al. [23] combine a modified fast matching method with guided filtering to inpaint Kinect depth maps. In [24], image segmentation is exploited to complete range data. Herrera et al. [25] propose an MRF with second-order prior to inpaint piece-wise planar depth maps. In [26], depth completion is formulated within a total variation framework where image cues guide the completion process. A different approach to depth completion consists of treating a depth map as an intensity image, and rely on standard image inpainting algorithms, such as [27, 28]. All the above-mentioned methods focus on depth completion form a single view and aim at completing the visible scene information only. By contrast, some approaches have proposed to exploit multiple views [14, 29] and thus can handle the fact that parts of the scene are hidden in some views, albeit not all of them. Similarly, great progress has been made in building complete scene models by fusing multiple noisy depth maps [3032]. These methods, however, assume to have access to multiple input depth images.

Only little work has been done on the problem of building a complete scene model in one shot, despite the presence of occluding objects. Guo and Hoiem [33] focus on semantic labeling of unseen surfaces without depth information. In the context of stereo matching, Bleyer et al. [34] introduce a method that hallucinates depth in the regions that are occluded in one view, but not in both. In [12, 13], while the goal is indeed to replace the depth of foreground objects with that of the background, the methods assume to be given a perfect foreground mask, defined by a user. As a consequence, these approaches truly perform depth completion, albeit without the knowledge of the RGB intensity behind the foreground mask. By contrast, [15, 16] work without any manual input. However, in both cases, the methods were designed for the indoor scenario, and are thus ill-suited to model complex outdoor scenes, which are typically much more challenging.

In this paper, we introduce a fully-automatic approach to jointly completing and hallucinating depth and semantics. A key component of our approach is the use of a Mumford-Shah functional [35], which defines a non-convex energy function that encourages piece-wise constant solutions. Strekalovskiy and Cremers [36] develop a real-time primal-dual algorithm for minimizing the Mumford-Shah functional with a single variable, which we use and extend in this paper. Furthermore, our work relies on the piece-wise planar world assumption [37]. Despite its simplicity, it has been widely adopted in modeling outdoor man-made scenes [38, 39]. Our work also relates to 3D scene understanding, where joint semantics and depth prediction has been explored, e.g., [40]. However, to the best of our knowledge existing methods do not recover hidden surfaces.

3 Our Approach

Given partial depth measurements and a corresponding intensity image, our goal is to produce a complete scene model with background depth and semantics at every pixel, including those that are hidden by foreground objects. To this end, we need to simultaneously perform depth completion, reason about semantics, and hallucinate the background scene behind the foreground objects.

To achieve this, we introduce a two-layer scene representation modeling the visible information and the hidden one. Each layer consists of two modalities: depth and semantics. The resulting model is encoded by a discrete-continuous optimization problem. In Sect. 4, we develop an optimization procedure to minimize the corresponding energy, thus allowing us to jointly complete and hallucinate depth and semantics.

3.1 A Visible Layer for Semantics-Aware Depth Completion

We first focus on modeling the scene that is visible in the input data. We assume that the underlying scene is piecewise planar and the corresponding semantic label map piecewise constant. Furthermore, we rely on the intuition that the depth discontinuities are often aligned with the boundaries of semantic classes, which lets us exploit the semantics to further regularize depth completion.

Let I be an input image of size \({m\times n}\) and \(\mathbf {x}\in \varOmega \) denote a pixel location on the two dimensional image plane \(\varOmega \). We associate each pixel with two variables encoding depth value and semantic label, respectively. The semantic label \(\mathbf {s}^v(\mathbf {x})\in \mathbb {R}^L\) is represented as an L-dimensional vector for L classes. As for depth, in this work, we make use of a disparity-based representation.Footnote 1 The motivation behind this is the following: Let \(y^v(\mathbf {x})\in \mathbb {R}\) be the disparity value at pixel \(\mathbf {x}\). This disparity value can be equivalently encoded by plane parameters \(\mathbf {u}^v(\mathbf {x})\in R^3\), since we can write \(y^v(\mathbf {x})=\mathbf {p}(\mathbf {x})^T{\mathbf {u}}^v(\mathbf {x})\), where \(\mathbf {p}(\mathbf {x})= (\mathbf {x}^T,1)^T\) is the homogeneous coordinate representation of \(\mathbf {x}\). Then, our piecewise planar assumption of the depth map, which is equivalent to a piecewise planar assumption of the disparity map, can be encoded by a piecewise constant assumption on the plane parameters. This therefore allows us to define a unified Mumford-Shah functional on \(\mathbf {u}^v\) and \(\mathbf {s}^v\), which simultaneously encodes our two initial assumptions.

The Mumford-Shah functional [35] was originally introduced to compute a piecewise smooth approximation of observed data. In our context, let us denote by \(\{y^o(\mathbf {x})\}_{\mathbf {x}\in \varOmega }\) the incomplete disparity measurements, with disparity observation mask \(\{d(\mathbf {x})\}_{\mathbf {x}\in \varOmega }\), where \(d(\mathbf {x}) = 1\) if the disparity measurement at pixel location \(\mathbf {x}\) is valid, and 0 otherwise. Furthermore, let \(\mathbf {s}^o(\mathbf {x})\) be a noisy label probability distribution at pixel \(\mathbf {x}\), obtained by any image-based semantic labeling method. Our goal therefore is for our visible layer to fit the observed data, and thanks to our change of variable, that both \(\mathbf {u}^v\) and \(\mathbf {s}^v\) are piecewise constant while having their discontinuities aligned. This can be expressed by a coupled Mumford-Shah functional of the form

$$\begin{aligned} {E_v(\mathbf{u}^v,\mathbf{s}^v)}&= {E_d(\mathbf {u}^v, \mathbf {s}^v) + E_{r,v}(\mathbf{{u}}^v,\mathbf{{s}}^v),} \end{aligned}$$
(1)

where \(E_d(\mathbf{{u}}^v, \mathbf{{s}}^v)\) is the data fidelity term, and \(E_{r,v}(\mathbf{{u}}^v,\mathbf{{s}}^v)\) denotes the regularization term that jointly encodes the piecewise constant and aligned discontinuities assumptions. We now describe these two energy terms in details.

Data term. The data term encourages the disparity and semantic label predictions to be consistent with the incomplete disparity measurements and the noisy semantic label probabilities. This can be expressed as

$$\begin{aligned} {E_d(\mathbf{{u}}^v, \mathbf{{s}}^v)}&= {\sum \limits _{\mathbf{{x}} \in \varOmega } d \cdot (\mathbf{{p}}^T \mathbf{{u}}^v - y^o)^2 + \eta _d \sum \limits _{\mathbf{{x}} \in \varOmega } \Vert \mathbf{{s}}^v - \mathbf{{s}}^o \Vert ^2\;.} \end{aligned}$$
(2)

where \(\eta _d\) is a weight that balances the influence of depth and semantics.

Regularization term. The regularization term encourages both \(\mathbf {u}^v\) and \(\mathbf {s}^v\) to be piecewise constant while having their discontinuities aligned. Following the Mumford-Shah formalism, we express this as

$$\begin{aligned} {E_{r,v}(\mathbf{{u}}^v,\mathbf{{s}}^v)}&= {\eta _{rv}\sum \limits _{\mathbf{{x}} \in \varOmega } \min (\alpha _1 \Vert \mathbf{{K}}\mathbf{{u}}^v\Vert ^2 + \Vert \mathbf{{K}}\mathbf{{s}}^v\Vert ^2, \lambda _1)\;,} \end{aligned}$$
(3)

where \(\eta _{rv}\) and \(\alpha _1\) are parameters controlling the strength of the smoothness and of the coupling between the two modalities and \(\lambda _1\) is the truncation parameter. Here, we further rely on the oriented gradient operator \(\mathbf{{K}}\) of [26], which computes an image-adaptive gradient for each channel of \(\mathbf {u}^v\) and \(\mathbf {s}^v\). More specifically, the oriented gradient operator \(\mathbf{{K}}\) at location \(\mathbf {x}\) is defined by \(T_I(\mathbf {x})\nabla \), where \(T_I\) is an image-based anisotropic diffusion tensor. This tensor is defined as

$$\begin{aligned} {T_I} = {\exp (-\beta |\nabla I|^\gamma )\mathbf{n}{} \mathbf{n}^T + \mathbf{n}^\bot \mathbf{n}^{\bot T},} \end{aligned}$$
(4)

where \(\mathbf{n} = \frac{\nabla I}{|\nabla I|}\) and \(\mathbf{n}^\bot \) is the normal vector to the image gradient. Note that \(T_I\) is a symmetric matrix, and hence \(\mathbf{{K}} = T_I(\mathbf {x})\nabla \) is a linear operator.

3.2 Adding a Hidden Layer for Depth and Semantics Hallucination

Recall that our goal is to produce a complete scene model from incomplete depth measurements. While the functional introduced in the previous section can complete the missing depth it still only represents the visible information. As such, it is unable to infer the scene depth and semantics behind the foreground objects. To address this limitation, we incorporate a hidden layer that focuses on modeling and hallucinating the depth and semantics of the background scene.

Formally, we split the semantic class set \(\mathcal {L}\) into two subsets, one for the foreground classes \(\mathcal {L}_f\) and the other for the background ones \(\mathcal {L}_b\). At each pixel location \(\mathbf {x}\), we introduce two additional variables, \(\mathbf {u}^h(\mathbf {x})\in \mathbb {R}^3\) and \(\mathbf {s}^h(\mathbf {x})\in \mathbb {R}^{L}\), which encode the (potentially occluded) disparity value and semantic label of the hidden scene layer at \(\mathbf {x}\). Furthermore, we define a binary variable \(m(\mathbf {x})\) indicating the foreground class mask (i.e., where the hidden layer is invisible). In other words, for the pixels where \(m(\mathbf {x})=1\), there are neither disparity measurements nor semantic predictions for the hidden layer variables \(\mathbf {u}^h(\mathbf {x})\) and \(\mathbf {s}^h(\mathbf {x})\). Note that this binary variable is not strictly necessary, since this information can be extracted from the semantics variables. However, as will be discussed in Sect. 4, introducing it makes the resulting problem easier to optimize.

To hallucinate the depth and semantics of the hidden scene layer, we rely on the following assumptions/constraints: In the parts of the image that correspond to foreground, (1) the hidden layer should be jointly piecewise constant in \(\mathbf{{u}}^h\) and \(\mathbf{{s}}^h\); (2) given training data, the hidden layer variables should follow the data statistics; (3) In the parts of the image that correspond to background, the visible and hidden layers should agree; (4) The mask and the visible semantics should be coherent. Below, we formalize these assumptions by defining a corresponding set of energy terms and linear constraints.

(1) Piecewise constancy. Similarly to the visible layer, we define a regularization term \(E_{r,h}(\mathbf{{u}}^h,\mathbf{{s}}^h, m)\) that encourages \(\mathbf {u}^h\) and \(\mathbf {s}^h\) to be piecewise constant and have aligned discontinuities. Here, however, we only enforce this term on the foreground regions, i.e., where \(m(\mathbf {x})=1\). This can be expressed as

$$\begin{aligned} {E_{r,h}(\mathbf{{u}}^h,\mathbf{{s}}^h, m)} = {\eta _{rh}\sum \limits _{\mathbf{{x}}} m \cdot \min (\alpha _2 \Vert \nabla \mathbf{{u}}^h\Vert ^2 + \Vert \nabla \mathbf{{s}}^h\Vert ^2, \lambda _2)\;,} \end{aligned}$$
(5)

where \(\eta _{rh}\) and \(\alpha _2\) are parameters controlling the strength of the smoothness and of the coupling between the two modalities, and \(\lambda _2\) is the truncation parameter. As there are no image cues for the hidden layer in the foreground regions, we use the standard gradient to penalize the discontinuities.

(2) Training data statistics. Given training data, we compute an average disparity map for each background class \(k\in \mathcal {L}_b\), denoted by \(\{y_k^s(\mathbf {x})\}_{\mathbf {x}\in \varOmega }\). We refer the reader to Sect. 5 for the details of this process. We then encourage the disparity and semantics of the hidden layer to be consistent with this statistics, which can be expressed as

$$\begin{aligned} {E_s(\mathbf{{u}}^h,\mathbf{{s}}^h,m)} = {\eta _{s}\sum \limits _{\mathbf{{x}}}m\cdot \sum \limits _{k\in \mathcal {L}_b}\mathbf {s}^h_k(\mathbf{{p}}^T \mathbf{{u}}^h - y_k^s)^2\;.} \end{aligned}$$
(6)

where \(\eta _s\) is a weight defining the influence of this term.

(3) Agreement between the two layers. These constraints can be directly expressed as

$$\begin{aligned} {\mathbf{{u}}^h(\mathbf{{x}})} = {\mathbf{{u}}^v(\mathbf{{x}}), \quad \mathbf{{s}}^h(\mathbf{{x}}) = \mathbf{{s}}^v(\mathbf{{x}}), \quad \forall \mathbf{{x}} \;| \;m(\mathbf{{x}}) = 0\;,} \end{aligned}$$
(7)

(4) Coherent mask and visible semantics. We encourage the mask and the visible semantics to agree by penalizing the discrepancy between the total probability mass of foreground classes predicted by \(\mathbf {s}^v\) and the mask variable at every pixel. This can be written as

$$\begin{aligned} {E_c(\mathbf{{s}}^v, m)} = {\eta _{c}\sum \limits _{\mathbf{{x}}} \big (\sum \limits _{k\in \mathcal {L}_f}\mathbf{{s}}_k^v - m + b\big )^2\;.} \end{aligned}$$
(8)

where \(\eta _c\) is a weighting parameter and b is a bias for the foreground mask.

Altogether, our two-layer approach to completing and hallucinating depth and semantics can be expressed as the discrete-continuous optimization problem

$$\begin{aligned} {\min \limits _{\mathbf{{u}}^v, \mathbf{{s}}^v, \mathbf{{u}}^h, \mathbf{{s}}^h, m}}&{E_d + E_{r,v} + E_{r,h} + E_s + E_c}\\ \mathrm{s.t.}\qquad&{ \mathbf{{u}}^h(\mathbf{{x}}) = \mathbf{{u}}^v(\mathbf{{x}}) ,\;\mathbf{{s}}^h(\mathbf{{x}}) = \mathbf{{s}}^v(\mathbf{{x}}) \quad \;\forall \mathbf{{x}} \;| \;m(\mathbf{{x}}) = 0} \nonumber \\&{\sum \limits _k \mathbf{{s}}^v_k(\mathbf{{x}}) = 1, \; \mathbf{{s}}_j^v(\mathbf{{x}})\ge 0,\; \sum \limits _k \mathbf{{s}}^h_k(\mathbf{{x}}) = 1, \; \mathbf{{s}}_j^h(\mathbf{{x}})\ge 0, \quad \forall \mathbf{{x}}, \; j }\nonumber \\&{ m(\mathbf{{x}}) \in \{0,1\}, \;\quad \forall \mathbf{{x}}\;}\nonumber \end{aligned}$$
(9)

where \(E_d\), \(E_{r,v}\), \(E_{r,h}\), \(E_s\), \(E_c\) are defined in Eqs. (2), (3), (5), (6) and (8), respectively. The first two constraints come from Eq. (7), and the third and fourth ones encode the simplex domain of probability distributions, and the fifth one the binary nature of the foreground mask m.

4 Optimizing Our Two-Layer Model

The optimization problem encoding our two-layer problem, defined in Eq. (9), is challenging to solve, since it has a large number of coupled discrete and continuous variables. Fortunately, given the disparity and semantics, optimizing the mask is straightforward; the optimal mask value at each pixel can be computed in a closed form. Furthermore, when the mask variables are given, the energy functional decomposes into two subproblems: one for the visible layer, and one for the hidden one. These subproblems correspond to multi-modal versions of the Mumford-Shah functional. An efficient first-order primal-dual algorithm was introduced by [36] to tackle the single-modality case. We show that this algorithm can be extended to address the multi-modal scenario.

We therefore adopt an alternating procedure to minimize Eq. (9). This procedure consists of three steps repeated iteratively. In the first and second step, we optimize w.r.t. the visible and hidden layer, respectively, and, in the third step, we update the mask variables. Since our procedure decreases the energy functional in every cycle, it converges to a local minimum. Below, we first review the first-order primal-dual algorithm of [36] for solving the Mumford-Shah functional and then discuss the solution to each step of our minimization strategy.

Primal-Dual Algorithm for the Mumford-Shah Functional. The primal-dual algorithm in [36] aims to solve a non-convex optimization problem of form

$$\begin{aligned} {\min \limits _{\mathbf{{y}}} D(\mathbf{{y}})+R(\mathbf{{A}} \mathbf{{y}})\;}, \end{aligned}$$
(10)

where \(D(\cdot )\) usually denotes a data fidelity term, and \(R(\cdot )\) is the regularization term encouraging piecewise smoothness in the Mumford-Shah functional. Let \(\mathbf{{A}}\) denote a linear operator, which can be the gradient operator \(\nabla \), or an oriented gradient operator \(\mathbf{{K}}\) additionally encoding image gradient information.

The primal-dual formulation introduces a dual variable \(\mathbf{q}\) and solves the equivalent saddle-point problem

$$\begin{aligned} {\min \limits _{\mathbf{y}}\max \limits _{\mathbf{q}}\;\;D(\mathbf{{y}})+ <\mathbf{q}, \mathbf{{A}}\mathbf{y}> - R^\star (\mathbf{q}).} \end{aligned}$$
(11)

where \(R^*\) is the conjugate of the regularization term. Following the fast Mumford-Shah method of [36], the primal-dual update equations can be written as

$$\begin{aligned} {\mathbf{q}^{n+1}}= & {} {prox_{\sigma _n, R^\star }(\mathbf{q}^n + \sigma _n \mathbf{{A}} \mathbf{{\bar{y}}}^n),}\quad {\mathbf{y}^{n+1}} = {prox_{\tau _n, D}(\mathbf{y}^n - \tau _n \mathbf{{A}}^{-1}{} \mathbf{q}^{n+1}),\;\;} \end{aligned}$$
(12)
$$\begin{aligned} {\theta _n}= & {} {\frac{1}{\sqrt{1+4\tau _n}},\;\tau _{n+1} = \theta _n\tau _n,\;\sigma _{n+1} = \frac{\sigma _n}{\theta _n}.}\end{aligned}$$
(13)
$$\begin{aligned} {\mathbf{{\bar{y}}^{n+1}}}= & {} {\mathbf{y}^{n+1}+\theta _n(\mathbf{y}^{n+1} - \mathbf{y}^n),} \end{aligned}$$
(14)

where \(prox_{\cdot , \cdot }(\cdot )\) denotes the proximal operator. The convergence [41] of this primal-dual procedure for a convex problem depends on the parameter values \(\tau \) and \(\sigma \), which must satisfy \(\tau \sigma \Vert \mathbf{{A}}\Vert ^2\le 1\). For non-convex functional, [36] shows the algorithm generates a bounded solution with empirically convergence.

Our procedure uses a similar primal-dual procedure to optimize the subproblems corresponding to the visible and hidden layers. These subproblems have a specific functional form for D and R. Moreover, they rely on two modalities, \(\mathbf {u}\) and \(\mathbf {s}\). Below, we develop our algorithms for the visible and hidden layers, respectively. We only provide the formulation of D and R as in Eq. (10) and refer the reader to the supplementary for the details of the proximal operators.

4.1 Optimization w.r.t. the Visible Layer \(\mathbf {s}^v\), \(\mathbf {u}^v\)

In this step, we fix the variables in the hidden layer \(\mathbf{{u}}^h\), \(\mathbf{{s}}^h\) and the foreground mask m, and optimize the subproblem defined on the visible layer. We also relax the consistent constraints of Eq. (9) at this step. We will enforce the constraints after optimizing w.r.t the visible and hidden layer. The resulting subproblem can thus be written as

$$\begin{aligned} {\min \limits _{\mathbf{{u}}^v, \mathbf{{s}}^v}} \qquad&{E_d(\mathbf{u}^v,\mathbf{s}^v) + E_{r,v}(\mathbf{u}^v,\mathbf{s}^v) + E_c(s^v,m).} \end{aligned}$$
(15)

Note that the subproblem objective can be written in the standard Mumford-Shah functional form when it is optimized w.r.t. either \(\mathbf {u}^v\) or \(\mathbf {s}^v\). Therefore, to optimize this subproblem with the primal-dual algorithm, we further divide the task into two steps.

Optimizing \(\mathbf {u}^v\) with fixed \(\mathbf {s}^v\). By fixing the semantic variable \(\mathbf {s}^v\), we can write the objective in Eq. (15) in the standard Mumford-Shah form, with

$$\begin{aligned} {D_{\mathbf {u}^v}(\mathbf {u}^v)}&= {\sum \limits _{\mathbf{{x}}}\Vert d(\mathbf{{p}}^T \mathbf{{u}}^v- y^o)\Vert ^2\;,} \end{aligned}$$
(16)
$$\begin{aligned} {R_{\mathbf {u}^v}(\mathbf{{K}}\mathbf {u}^v)}&= {\eta _{rv}\sum \limits _{\mathbf{{x}} \in \varOmega } \min (\alpha _1 \Vert \mathbf{{K}}\mathbf{{u}}^v\Vert ^2 + e_{uv}, \lambda _1)\;,} \end{aligned}$$
(17)

where \(e_{uv}=\Vert \mathbf{{K}} \mathbf {s}^v\Vert ^2\). Here, \(\Vert \mathbf{{K}}\mathbf{{u}}\Vert ^2:=\sum _{j}\Vert \mathbf{{K}} \mathbf {u}_j\Vert ^2\) denotes the Euclidean norm, where \(\mathbf {u}_j\) is the j-th channel in the multi-channel variable \(\mathbf {u}\).

Optimizing \(\mathbf {s}^v\) with fixed \(\mathbf {u}^v\). We then fix the disparity variable \(\mathbf {u}^v\), and write the objective in Eq. (15) in the standard form, which yields

$$\begin{aligned} {D_{\mathbf {s}^v}(\mathbf {s}^v)}&= {\sum \limits _{\mathbf{{x}}}\eta _d\Vert (\mathbf{{s}}^v- \mathbf{{s}}^o)\Vert ^2+\eta _{c}\sum \limits _{\mathbf{{x}}} (\mathbf{{f}}^T \mathbf{{s}}^v - m + b)^2 \;,}\end{aligned}$$
(18)
$$\begin{aligned} {R_{\mathbf {s}^v}(\mathbf{{K}}\mathbf {s}^v)}&= {\eta _{rv}\sum \limits _{\mathbf{{x}} \in \varOmega } \min (\alpha _1 e_{sv} + \Vert \mathbf{{K}}\mathbf{{s}}^v\Vert ^2, \lambda _1) \;,} \end{aligned}$$
(19)

where \(e_{sv}=\Vert \mathbf{{K}} \mathbf {u}^v\Vert ^2\), and \(\mathbf{{f}}\) is a binary vector with 1 s in the position corresponding to the foreground classes and 0 everywhere else.

4.2 Optimization w.r.t. the Hidden Layer \(\mathbf {s}^h\), \(\mathbf {u}^h\)

Let us now fix the disparity and semantics of the visible layer \(\mathbf {u}^v\), \(\mathbf {s}^v\) and the foreground mask \(\mathbf {m}\), and optimize the functional w.r.t. the hidden layer variables \(\mathbf {u}^h\), \(\mathbf {s}^h\). We consider the following equivalent subproblem

$$\begin{aligned} {\min \limits _{\mathbf{{u}}^h, \mathbf{{s}}^h}} \qquad&{E_s(\mathbf{u}^h,\mathbf{s}^h,m) + E_{r,h}(\mathbf{u}^h,\mathbf{s}^h,m) + E_p(\mathbf {u}^h,\mathbf {s}^h)} \end{aligned}$$
(20)

Where \(E_p(\cdot )\) is a regularization term with the following form:

$$\begin{aligned} {E_p(\mathbf {u}^h,\mathbf {s}^h) = \gamma _{uh} \sum \limits _{\mathbf{{x}}} (1-m)(\mathbf{{p}}^T \mathbf{{u}}^h - \mathbf{{p}}^T \mathbf{{u}}^v)^2 + \gamma _{sh} \mathop {\sum }\nolimits _{\mathbf{{x}}} (1-m)(\mathbf{{s}}^h - \mathbf{{s}}^v)^2} \end{aligned}$$
(21)

Here \(\gamma _{uh}\) and \(\gamma _{sh}\) are large weights (usually 1000), and we essentially use a soft version of consistency constraints to regularize the problem, which empirically produces a more stable optimization step. Similar to the visible layer, we divide the optimization of this subproblem into two steps.

Optimizing \(\mathbf {u}^h\) with fixed \(\mathbf {s}^h\). Fixing the semantic variable \(\mathbf {s}^h\), and writing the objective in Eq. (20) in the standard form yields

$$\begin{aligned}&{D_{\mathbf {u}^h}(\mathbf {u}^h) =\gamma _{uh} \sum \limits _{\mathbf{{x}}} (1-m)(\mathbf{{p}}^T \mathbf{{u}}^h - \mathbf{{p}}^T \mathbf{{u}}^v)^2 +m\;\eta _{s}\sum \limits _{j}s_j^h(\mathbf{{p}}^T \mathbf{{u}}^h - y_j^s)^2\;,} \end{aligned}$$
(22)
$$\begin{aligned}&{R_{\mathbf {u}^h}(\nabla \mathbf {u}^h) = \eta _{rh} m\min (\alpha _2 \Vert \nabla \mathbf{{u}}^h\Vert ^2 + e_{uh},\lambda _2)\;,} \end{aligned}$$
(23)

where \(e_{uh}=\Vert \nabla \mathbf{{s}}^h\Vert ^2\).

Optimizing \(\mathbf {s}^h\) with fixed \(\mathbf {u}^h\). We then fix the disparity variable \(\mathbf {u}^h\), and write the objective in Eq. (20) in the standard form, which yields

$$\begin{aligned} {D_{\mathbf {s}^h}(\mathbf {s}^h)}&={\gamma _{sh} \sum \limits _{\mathbf{{x}}} (1-m)(\mathbf{{s}}^h - \mathbf{{s}}^v)^2 + m\;\eta _{s}\sum \limits _{j}s_j^h(\mathbf{{p}}^T \mathbf{{u}}^h - y_j^s)^2 \;,} \end{aligned}$$
(24)
$$\begin{aligned} {R_{\mathbf {s}^h}(\nabla \mathbf {s}^v)}&= {\eta _{rh}m\sum \limits _{\mathbf{{x}} \in \varOmega } \min (\alpha _2 e_{sh} + \Vert \nabla \mathbf{{s}}^v\Vert ^2, \lambda _2)\;,} \end{aligned}$$
(25)

where \(e_{sh}=\Vert \nabla \mathbf {u}^h\Vert ^2\).

4.3 Adding Constraints and Updating the Foreground Mask m

After computing the visible and hidden variables without the constraints, we now project them onto the constraint set defined in Eq. (9). The projection onto the consistent constraint set is computed as \(\mathbf {s}^v=\mathbf {s}^h=\frac{\mathbf {s}^v+\mathbf {s}^h}{2}\) and \(\mathbf {u}^v=\mathbf {u}^h=\frac{\mathbf {u}^v+\mathbf {u}^h}{2}\). For semantics \(\mathbf {s}^v\), \(\mathbf {s}^h\), we then project them onto the probability simplex.

Given the semantic and disparity variables in the visible and hidden layers, the foreground mask variables are decoupled into a set of independent variables for each location \(\mathbf {x}\). The problem can then be re-written as

$$\begin{aligned} {\min \limits _{m}\sum \limits _{\mathbf {x}} w(\mathbf {x})m(\mathbf {x}) \;, \;\; \text {s.t.}\;\; m(\mathbf {x})\in \{0,1\},} \end{aligned}$$
(26)

where the weight \(w(\mathbf {x})\) is given by

$$\begin{aligned} {w(\mathbf {x})=\eta _{rh}\cdot \min (\alpha _2 \Vert \nabla \mathbf{{u}}^h\Vert ^2 + \Vert \nabla \mathbf{{s}}^h\Vert ^2, \lambda _2) + \eta _{s}\sum \limits _{j}s_j^h(\mathbf{{p}}^T \mathbf{{u}}^h - y_j^s)^2 + \eta _{c}\big (1-2(\mathbf{{f}}^T \mathbf{{s}}^v + b)\big ).} \end{aligned}$$
(27)

Ultimately, \(m(\mathbf {x}) = 1\) if \(w(\mathbf {x})<0\), and 0 otherwise.

5 Experiments

To demonstrate the effectiveness of our approach, we evaluated our method on two publicly available outdoor datasets: KITTI [17] and Stixel [18]. Below, we discuss our results on both datasets.

5.1 Experimental Setup

Initialization. We used SLIC [42] to produce an over-segmentation of the image, and fit a plane to each superpixel using the corresponding sparse depth observations. The resulting plane parameters are used as initialization for \(\mathbf{{u}}^v\) for each pixel in the superpixels. For large holes where no observations were available in the superpixels, we initialized the plane parameters to zero.

We adopted the FCN-32s model [43] followed by smoothing via a fully-connected CRF [44], which allowed us to initialize \(\mathbf{{s}}^v\) and foreground mask \(\mathbf{{m}}\), as well as provides the observations \(\mathbf{{s}}^o\). We initialize \(\mathbf{{u}}^h\) and \(\mathbf{{s}}^h\) from \(\mathbf{{u}}^v\) and \(\mathbf{{s}}^v\) and set the foreground regions to 0.

Ground-truth for the hidden layer. To the best of our knowledge, no ground-truth is available for the hidden layer variables. In order to provide a quantitative evaluation, we generated the ground truth in two different ways: (1) Manual annotation. We first annotated the hidden semantic labels, based on which we then filled in the hidden depth using the planes fitted to the superpixels around the true foreground mask. (2) Image and depth composition. We overlaid an object from an image (foreground image) on a background image of unoccluded scene. Since the camera intrinsics are roughly the same for both images, the depth map would be consistent after adding the object in the same location as in the foreground image.

Co-occurence statistics. To obtain the class-dependent disparity statistics \(\{y^s_k\}\) in Eq. (6), we followed the intuition that semantics are often highly correlated with image location, which was exploited, for example, in [45] for depth prediction. To this end, we follow a superpixel-based approach. For each superpixel j in the test image, we take the plane parameters of the corresponding pixels in all the training images. For each class k, we then cluster these plane parameters, and take the cluster center with largest size. We finally generate \(y^s_k\) as the disparity obtained from the plane parameters of this center.

Baselines. Note that our scene model consists of two layers. For the visible layer, depth estimation translates to the usual depth completion problem. We therefore compare the results of our visible layer with the of the classical method of [28], and with the more recent technique of [26].

For the hidden layer, since no other has tackled the outdoor scenario in a fully-automatic manner, we rely on the following two-stage strategy. We first generate a foreground mask using the state-of-the-art semantic labeling method, FCN-32s model [43], followed by a smoothing with a fully-connected CRF [44]. Let us denote by Fg-Mask this foreground mask and by Bg-Mask the remaining image pixels. In Bg-Mask, the appearance is known, and thus the same depth completion methods as before can be employed. In Fg-Mask, however, no appearance information about the background is available. We therefore apply the technique of [27] to inpaint this area, which, to the best of our knowledge, remains the most mature method when it comes to depth completion without intensity information. This yields two baselines, which we will refer to as Baseline-1 (semantic segmentation followed by [28] + [27]) and Baseline-2 (semantic segmentation followed by [26] + [27]). To compare the different algorithms, we make use of the following metrics:1) visible-rmse: the-root-mean-square-error (rmse) for the entire depth map; 2)hidden-rmse: the rmse for the depth map hallucinated underneath the ground truth foreground mask.

Fig. 2.
figure 2

Qualitative results on the KITTI dataset. For the disparity values, red denotes large values, and blue denotes small disparity values. From top to bottom: RGB image, ground-truth visible disparity map, sparse observations with large holes, our completed disparity map, two baselines for the visible layer, ground truth disparity for the hidden layer, our disparity for the hidden layer, and two baselines for the hidden layer. Note that our method can remove the foreground as well as accurately fill in the background disparity behind the foreground objects. Compared to the baselines, our approach can better complete the disparity for the visible and hidden layers. (Color figure online)

5.2 Results on KITTI

As a first dataset, we utilized three subsets of the KITTI data annotated with semantic labels and/or disparity maps, and provided by (i) Ladický et al. [46], i.e., 60 aligned images, with dense disparity map and accurate semantic labels; (ii) Xu et al. [47], i.e., 107 images with accurate semantic labels; and (iii) Ros et al. [48], i.e., 146 images with accurate semantic labels. Note that only Ladický et al. [46] provide ground-truth disparity maps. However, this subset is constrained in terms of the scene types it depicts, i.e. mostly residential areas. To make our evaluation more meaningful, we therefore only used 40 images of the first subset as test images, complemented by 14 images from the other subsets. To obtain the ground-truth disparity maps for these 14 images, we employed the MC-CNN-acrt stereo matching algorithm [49], which ranks at the top in the KITTI stereo challenge. To avoid biasing our conclusions with these different types of ground-truth, we report results on the entire set, \( test-54\), and on the two subsets, \(sub-40\) and \(sub-14\), respectively. We also partitioned the data according to Manhattan (MH: 35 images) vs Non-Manhattan (NMH: 19 images) scenes, and further evaluate our method on two different scene structures. The remaining images from the three subsets were split into 200 for training and 59 for validation. For semantics, we mapped different label annotations to 9 classes and fine-tuned the FCN-32s of [43] to these 9 classes using the training data. We then define car and pedestrian as foreground classes.

In Table 1, we compare the results of our approach with the baselines for both the visible and hidden layers using the manually annotated ground truth. Note that we outperform the baselines in most cases. In particular, our approach yields a large improvement in the hidden regions of the image. This evidences that our two-layer model is well-suited for the task of hallucinating depth, and thus constitutes a significant step towards being able to build scene models despite the presence of occluding foreground objects. Note that the fact that our model also yields more accurate depth estimates in the visible regions than state-of-the-art depth completion methods also suggests that it effectively leverages the visible information. Additionally, we created a test set of 14 images using the composition strategy described in the previous section, which gives us access to the ground-truth hidden depth. Note that the 14 images were chosen to respect the scene type ratio of the original test data. The resulting hidden-rmse of our method is 7.72, which is superior to Baseline-1 (9.76) and Baseline-2 (10.94). Figure 2 provides a qualitative comparison of our results with the ground truth and the baselines.

In Table 2, we show the results of our semantics labeling estimates for the hidden regions. Here, since no baseline is available for this task, we only report the results of our approach. These results show that, while hallucinating small classes, such as fence and poles, remains challenging, our model yields good accuracy on the more common and larger classes. Note that effectively handling the small classes in outdoor semantic labeling is known to be difficult even when leveraging visible information. Finally, we observed that the semantic labeling accuracy in the visible layer did not significantly change compared to our initialization. In particular, we obtained \(88.51\,\%\) per pixel accuracy and \(67.28\,\%\) average per class accuracy. In Fig. 3, we provided the qualitative results for semantic segmentation on KITTI dataset.

To further illustrate the effect of our approach on the visible semantics, we initialized our algorithm with the results of FCN-32s only. The per-pixel and per-class accuracies of FCN-32s were \(87.86\,\%\) and \(69.98\,\%\), respectively. Our method improved the per-pixel accuracy to \(88.5\,\%\) and left the per-class one virtually unchanged (\(69.81\,\%\)). This also resulted in an improved visible-rmse of 5.01.

Table 1. Depth estimation. Quantitative comparison with several baselines for the visible and hidden depth, respectively.
Table 2. Estimating hidden semantics. Per-class and overall accuracy of our approach.
Fig. 3.
figure 3

Qualitative results for semantic segmentation on the KITTI dataset. From top to bottom: RGB image, ground truth results and our results, ground truth disparity for the hidden layer, our disparity for the hidden layer, Baseline 1 and Baseline 2, ground truth semantics for the hidden layer, and our estimated semantics for the hidden layer. (Color figure online)

5.3 Results on Stixel

As a second experiment, we employed the Stixel dataset. This dataset contains 500 images with corresponding noisy depth (disparity) maps and semantics, partitioned into 300 training images and 200 test images. Note that the disparity provided in this dataset was computed using a semi-global matching algorithm. Since ground-truth disparity is only partially available for this dataset, it is therefore not possible to generate the ground-truth disparity for the foreground mask as before. We therefore only provide a qualitative comparison of our approach with the baselines. There are 5 semantic classes in the dataset. We define car and pedestrian as the foreground class. The qualitative results of this dataset are shown in the Supplementary Material (Fig. 4). Note that, again, we can see that our approach produces more accurate disparity maps.

6 Conclusion

We have introduced a fully-automatic approach to jointly completing and hallucinating depth and semantics from an incomplete depth map and an RGB image. To this end, we have developed a two-layer model, encoding both the visible information and the information hidden behind the foreground objects. Furthermore, we have designed an effective strategy to optimize our two-layer model. Our experiments have evidenced that our approach can accurately fill the large holes in the input depth map, produce a semantic segmentation of the observed scene, and hallucinate the depth and semantics behind the foreground objects. In the future, we plan to extend our method to accumulate the information observed in a video sequence of a dynamic scene.