Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation

Zhang, Zhenyu; Cui, Zhen; Xu, Chunyan; Jie, Zequn; Li, Xiang; Yang, Jian

doi:10.1007/978-3-030-01249-6_15

Zhenyu Zhang¹⁷,
Zhen Cui¹⁷,
Chunyan Xu¹⁷,
Zequn Jie¹⁸,
Xiang Li¹⁷ &
…
Jian Yang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11214))

Included in the following conference series:

European Conference on Computer Vision

3186 Accesses
112 Citations

Abstract

In this paper, we propose a novel joint Task-Recursive Learning (TRL) framework for the closing-loop semantic segmentation and monocular depth estimation tasks. TRL can recursively refine the results of both tasks through serialized task-level interactions. In order to mutually-boost for each other, we encapsulate the interaction into a specific Task-Attentional Module (TAM) to adaptively enhance some counterpart patterns of both tasks. Further, to make the inference more credible, we propagate previous learning experiences on both tasks into the next network evolution by explicitly concatenating previous responses. The sequence of task-level interactions are finally evolved along a coarse-to-fine scale space such that the required details may be reconstructed progressively. Extensive experiments on NYU-Depth v2 and SUN RGB-D datasets demonstrate that our method achieves state-of-the-art results for monocular depth estimation and semantic segmentation.

You have full access to this open access chapter, Download conference paper PDF

CAENet: Efficient Multi-task Learning for Joint Semantic Segmentation and Depth Estimation

CI-Net: a joint depth estimation and semantic segmentation network using contextual information

Article 09 April 2022

KIL: Knowledge Interactiveness Learning for Joint Depth Estimation and Semantic Segmentation

Keywords

1 Introduction

Semantic segmentation and depth estimation from single monocular images are two challenging tasks in computer vision, due to lack of reliable cues of a scene, large variations of scene types, cluttered backgrounds, pose changing and occlusions of objects. Recently, driven by deep learning techniques, the study on them has seen great progress and starts to benefit some potential applications such as scene understanding [1], robotics [2], autonomous driving [3] and simultaneous localization and mapping (SLAM) system [4]. Despite the successes of deep learning (especially CNNs) on monocular depth estimation [5,6,7,8,9] and semantic segmentation [10,11,12,13], most of these methods emphasize to learn robust regression yet scarcely consider the interactions between them. Actually, the two tasks have some common characteristics, which can be utilized for each other. For example, semantic segmentation and depth of a scene can both reveal the layout and object shapes/boundaries. The recent work in the literature [14] also indicated that leveraging the depth information from RGB-D data may facilitate the semantic segmentation. Therefore, a joint learning of both tasks should be considered to reciprocally promote for each other.

Existing joint learning of two tasks falls into the category of multi-task learning, which has been extensively studied in the past few decades [15]. It involves many cross tasks, such as detection and classification [16, 17], depth estimation and image decomposition [18], image segmentation and classification [19], and also depth estimation and semantic segmentation [20,21,22], etc. But such existing joint learning methods mainly belong to the shallow task-level interaction. For example, a shared deep network is utilized to extract the common features for both tasks, and bifurcates from a high-level layer to perform the two tasks individually [16,17,18,19, 21, 22]. As such, in these methods, less interaction is taken due to the relative independency between tasks. However, it is well known that human learning system benefits from an iterative/looping interactive process between different tasks [23]. Taking a simplest commonsense case, alternately reading and writing can promptly improve human capability in the both aspects. Therefore, we argue whether task-alternate learning (such as cross segmentation and depth estimation) can go deeper with the breakthrough of deep learning.

To address such problem, in this paper, we propose a novel joint Task-Recursive Learning (TRL) framework to closely-loop semantic segmentation and depth estimation on indoor scenes. The interactions between both tasks are serialized as a newly-created time axis, as shown in Fig. 1. Along the time dimension, the two tasks $\{D, S\}$ are mutually collaborate to boost the performance for each other. In each interaction, the historical experiences of previous states (i.e., features of the previous time steps of the two tasks) will be selectively propagated and help to estimate the new state, as ploted by the arc and horizontal black arrows. To properly propagate the information stream, we design a Task-Attentional Module (TAM) to correlate the two tasks, where the useful common information related to the current task will be enhanced while suppressing task-irrelevant information. Thus the learning process of the two tasks can be easily modularized into a sequence network called task-recursive learning network in this paper. Besides, considering the difficulty of high-resolution pixel-level prediction, we derive the recursive task learning on a sequence of coarse-to-fine scales, which would progressively refine the details of the estimation results. Extensive experiments demonstrate that our proposed task-recursive learning can benefit the two tasks for each other. In summary, the contributions of this paper are three folds:

Propose a novel joint Task-Recursive Learning (TRL) framework for semantic segmentation and depth estimation. Serializing the problems as a task-alternate time sequence, TRL can progressively refine and mutually boost the two tasks through properly propagating the information stream.
Design a Task-Attentional Module (TAM) to enclose the interaction of the two tasks, which thus can be used in those conventional networks as a general layer or module.
Validate the effectiveness of the deeply task-alternate mechanism, and achieve some new state-of-the-art results of for the dual tasks of depth estimation and semantic segmentation on NYU Depth V2 and SUN RGBD datasets.

2 Related Work

Depth Estimation: Many works have been proposed for monocular depth estimation. Eigen et al. [5, 24] proposed a multi-stage CNN to resolve the monocular depth prediction. Liu et al. [25] and Li et al. [26] utilized CRF models to capture local image texture and guide the network learning process. Recently, Laina et al. [7] proposed a fully convolutional network with up-projection to achieve an efficient upsampling process. Xu et al. [6] employed multi-scale continuous CRFs as a deep sequential network. In contrast to these methods, our approach focuses on the dual-task learning, and attempts to utilize segmentation cues to promote depth prediction.

Semantic Segmentation: Most methods [10, 11, 27,28,29] conducted semantic segmentation from single RGB image. As the large RGBD dataset was released, some approaches [30, 31] attempted to fuse depth information for better segmentation. Recently, Cheng et al. [32] computed the affinity matrices from RGB images and HHA depth images for better upsampling important locations. Different from these RGBD based methods, our method does not directly use ground truth of depth, but the estimated depth for semantic segmentation, which thus essentially falls into the category of RGB image segmentation.

Multi-task Learning: The generic multi-task learning problem [15] has been studied for a long history, and numerous methods were developed in different research areas such as representation learning [33,34,35], transfer learning [36, 37], computer vision [16, 17, 19, 38,39,40]. Here the most related works are those multi-task learning methods of computer vision. For examples, the literatures [21, 22] utilized CNN with hierarchical CRFs and multi-decoder to obtain depth estimation and semantic segmentation. In the literature [19], a cross-stitch unit was proposed to better interact two tasks. The recent proposed Ubernet [40] attempted to give a solution for various tasks on diverse datasets with limited memory. Different from these previous works, our proposed TRL takes multi-task learning as a deep manner of task interactions. Specifically, depth estimation and semantic segmentation are mutually boosted and refined in a general recursive architecture.

3 Approach

3.1 Motivation

Here we focus on the interactive learning problem of two tasks including depth estimation and semantic segmentation from a monocular RGB image. Our motivation mainly comes from two folds: (i) human learning benefits from an iterative/looping interactive process between tasks [23]; (ii) Such a couple of tasks are complementary to some extent besides sharing some common information. Therefore, our aim is to make the task-level alternate interaction go deeper, so as to let the two tasks mutually boosted. The main idea is illustrated in Fig. 1. We define the task-alternate learning processes as a series of state transformation along the time axis. Formally, we denote the states of depth estimation and semantic segmentation tasks as $\text {D}_p$ and $\text {S}_p$ at time step p respectively, and the corresponding responses as ${f}_{D}^{p}$ and ${f}_{S}^{p}$. Suppose the previous obtained experiences as $\mathcal {F}^{p-1:p-k}_{D} = \{{f}_{D}^{p-1},{f}_{D}^{p-2},\dots ,{f}_{D}^{p-k}\}$ and $\mathcal {F}^{p-1:p-k}_{S} = \{{f}_{S}^{p-1},{f}_{S}^{p-2},\dots ,{f}_{S}^{p-k}\}$, then we formulate the dual-task learning at the time clip p as

$$\begin{aligned} \left\{ \begin{aligned} \text {D}^{p}&= \varPhi _D^{p}(\mathcal {T}(\mathcal {F}^{p-1:p-k}_{D},~\mathcal {F}^{p-1:p-k}_{S}), \varTheta _D^p) \\ \text {S}^{p}&= \varPhi _S^{p}(\mathcal {T}(\mathcal {F}^{p:p-k+1}_{D},~\mathcal {F}^{p-1:p-k}_{S}),\varTheta _S^p) \end{aligned} \right. , \end{aligned}$$

(1)

where $\mathcal {T}$ is the interactive function (designed as task-attentional module below), $\varPhi _D^{p}$ and $\varPhi _S^{p}$ are transformation functions to predict the next state with the parameters $\varTheta _D^p$ and $\varTheta _S^p$ to be learnt. As the time slice p, the depth estimation $\text {D}_p$ is on the conditions of previous k-order experiences $\mathcal {F}^{p-1:p-k}_{D}$ and $\mathcal {F}^{p-1:p-k}_{S}$, and the segmentation $\text {S}_t$ is dependent on $\mathcal {F}^{p:p-k+1}_{D}$ and $\mathcal {F}^{p-1:p-k}_{S}$. In this way, those historical experiences from both tasks will be propagated along the time sequences by using TAM. That means, the dual-task interactions will go deeper along the sequence of states. As a general idea, the framework can be adapted to other dual-task applications and even multi-task learning. We give the formulation of multi-task learning in the supplemental materials. In this paper we simply set $k=1$ in Eq. 1, i.e., a short-term dependency.

3.2 Network Architecture

Overview: The entire network architecture is shown in Fig. 2. We use the sophisticated ResNet [41] to encode the input image. The gray cubes from Res-2 to Res-5 are multi-scale response maps extracted from ResNet. The next decoding process is designed to solve the dual tasks based on the task-recursive idea. The decoder is composed of upsampling blocks, task-attentional modules and residual-blocks. The upsampling blocks upscale the convolutional features to required scales for pixel-level prediction. The detailed architecture will be introduced in the following subsection. For the pixel-level prediction, we introduce residual-blocks (blue cubes) to decode the previous features, which are the mirror type of the corresponding ones in the encoder but only have two bottle-necks in each residual block. The Res-d1, Res-d3, Res-d5 and Res-d7 focus on depth estimation, while the rest ones focus on semantic segmentation. The TAM is designed to perform the interaction of two tasks. During the interaction, the previous information will be selectively enhanced to adapt to the current task. For example, the TAM before Res-d5 receives inputs from two sources: one is the features upsampled from Res-d4 with segmentation information, and the other is the features upsampled from Res-d3 with depth information. During the interaction, the information of two inputs will be selectively enhanced to propagate to the next task. As the interaction times increase, the results of the two tasks are progressively refined in a mutual-boosting scheme. Another import strategy is taking a coarse-to-fine process to progressively reconstruct details and produce fine-grained predictions of high resolution. Concretely, we concatenate the different-scale features of encoder to the corresponding residual block, as indicated by the green arrows. The upsampling block and the task-attentional module will be described in the following subsections.

Task-Attentional Module. As discussed in the Sect. 1, semantic segmentation and depth estimation results of a scene have many common patterns, e.g., they can both reveal the object edges, boundaries or layouts. To better mine and utilize the common information, we design a task-attentional module to enhance the correlated information of the two tasks. As illustrated in Fig. 2, the TAM is used before each residual block and takes depth/segmentation features from previous residual blocks as inputs. The designed TAM are presented in Fig. 3(a). The input depth/segmentation features are firstly fed into a balance unit to balance the contribution of the features of two sources. If we use $f_d$ and $f_s \in R ^{H\times W\times C}$ to denote the received depth and segmentation features respectively, the balance unit can be formulated as:

$$\begin{aligned} B&= \text {Sigmoid}(\varPsi _1(\text {concat}(f_d, f_s), \varTheta _1)),\nonumber \\ f_b&= \varPsi _2(\text {concat}(B\cdot f_d, (1-B)\cdot f_s), \varTheta _2), \end{aligned}$$

(2)

where $\varPsi _1$ and $\varPsi _2$ are two convolutional layers with parameters $\varTheta _1$ and $\varTheta _2$, respectively. $B \in R ^{H\times W\times C}$ is the learnt balancing tensor, and $f_b \in R ^{H\times W\times C}$ is the balanced output of the balance unit. In this way, $f_b$ combines the balanced information from the two sources. Next, the balanced output will be fed into a series of conv-deconvolutional layers, as illustrated by the yellow cubs in Fig. 3(a). Such a mechanism is designed to get different spatial attentions by using the receptive field variation, as demonstrated in the residual attention [42]. After a Sigmoid transformation, we get an attentional map $\text {M}\in R ^{H\times W\times C}$, which is expected to have higher responses on the common patterns. Finally, the attentional tensor $\text {M}$ is used to generate the gated depth/segmentation features, formally,

$$\begin{aligned} f^g_d&= (1+\text {M})\cdot f_d, \nonumber \\ f^g_s&= (1+\text {M})\cdot f_s. \end{aligned}$$

(3)

Thus the feature $f_d$ and $f_s$ may be enhanced through the learned attentional map $\text {M}$. The gated features $f_d^g$ and $f_s^g$ are further fused by concatenation followed by one convolutional layer. The output of TAM is denoted as $f_{\text {TAM}}\in {\text {R}}^{H\times W\times C}$. The task-attentional module can benefit our task-recursive learning method as experimentally analysed in Sect. 4.2.

Upsampling Blocks: The upsampling blocks are designed to match the scale variations during the task-recursive learning. The architecture of upsampling block is shown in Fig. 3(b). The features with size of ${H\times W\times C}$ are firstly fed into four parallel convolutional layers with different receptive fields (i.e., conv-1 to conv-4 in Fig. 3). These four convolutional layers are designed to capture different local structures. Then the responses produced from the four convolutional layers are concatenated to a tensor feature with size of $H\times W\times 2C$. Finally, the sub-pixel operation in [43] is applied to spatially upscale the feature. Formally, given a tensor feature T and a coordinate [h, w, c], the sub-pixel operator can be defined as:

$$\begin{aligned} \mathcal {P}(T_{h,w,c})= T_{\lfloor {h/r}\rfloor ,\lfloor {w/r}\rfloor ,c\cdot r\cdot \text {mod}(w,r)+c\cdot \text {mod}(h,r)}, \end{aligned}$$

(4)

where r is the scale factor. After such sub-pixel operation, the output of one upsampling block is the feature of size $2H\times 2W\times C/2$, when we set $r=2$. The upsampling blocks are more effective than the general deconvolution, as verified in the experiments in Sect. 4.2.

3.3 Training Loss

We impose the supervised loss constraint on each scale to obtain multi-scale predictions. For depth estimation, we use inverse Huber loss defined in [7] as the loss function, which can be formulated as:

$$\begin{aligned} \mathcal {L}^D(d_i)={\left\{ \begin{array}{ll} |d_i|, &{} |d_i|\le {c},\\ \frac{d_i^2+c^2}{2c}, &{} |d_i|>c, \end{array}\right. } \end{aligned}$$

(5)

where $d_i$ is the difference between prediction and ground truth at each pixel i, and c is a threshold with $c = \frac{1}{5}\max (d_i)$ as default. Such a loss function can provide more obvious gradients at the locations where the depth difference is low, and thus can help to better train the network. The loss function for semantic segmentation is a cross-entropy loss, denoted as $\mathcal {L}^S$. For a better optimization of our proposed dual-task network, we use the strategy proposed in [22] to balance the two tasks. Suppose the network predicts N pairs (w.r.t. N scales) of depth maps and semantic segmentation maps, the total loss function can be defined as:

$$\begin{aligned} \mathcal {L}(\varTheta , \sigma _1, \sigma _2) = \frac{1}{\sigma ^2_1}\sum _{n=1}^N \mathcal {L}_n^D + \frac{1}{\sigma ^2_2}\sum _{n=1}^N\mathcal {L}_n^S + \log (\sigma ^2_1) + \log (\sigma ^2_2), \end{aligned}$$

(6)

where $\varTheta $ is the parameter of network, $\sigma _1$ and $\sigma _2$ are the balancing weights to the two tasks. Please note that the balancing weights are also optimized as parameters during training. In practice, to avoid a potential division by zero, we redefine $\delta = \log \sigma ^2$. Thus the total loss can be rewritten as:

$$\begin{aligned} \mathcal {L}(W, \delta _1, \delta _2) = \exp (-\delta _1)\sum _{n=1}^N \mathcal {L}_n^D + \exp (-\delta _2)\sum _{n=1}^N\mathcal {L}_n^S + \delta _1 + \delta _2. \end{aligned}$$

(7)

4 Experiments

4.1 Experimental Settings

Dataset: We evaluate the effectiveness of our proposed method on NYU Depth V2 [1] and SUN RGBD [44] datasets. The NYU Depth v2 dataset [1] consists of RGB-D images of 464 indoor scenes. There are 1449 images with semantic labels, 795 of them are used for training and the remaning 654 images for testing. We randomly select 4k images of the raw data from official training scenes. These 4k images have the corresponding depth maps but no semantic labels. Before training our network, we first train a ResNet-50 based DeconvNet [11] for 40-class semantic segmentation using the given 795 images. Then we use the predictions of the trained DeconvNet on the 4k images as coarse semantic labels to train our network. Finally we fine-tune the network on the 795 images of standard training split. The SUN RGBD dataset [44] contains 10355 RGB-D images with semantic labels of which 5285 for training and 5050 for testing. We use the 5285 images with depth and semantic labels to train our network, and the 5050 images for evaluation. The semantic labels are divided into 37 classes. Following the settings in [6, 7, 24, 32], we use the same data augmentation strategies including cropping, scaling, flipping and rotating, to increase the diversity of data. As the largest outputs are half size of the input images, we upsample the predicted segmentation results and depth maps to the original size for comparison.

Implementation Details: We implement the proposed model using Pytorch on a single Nvidia P40 GPU. We build our network based on ResNet-18, ResNet-50 and ResNet-101, and each model is pre-trained on the ImageNet classification task [45]. ReLU activating function and Batch normalization are applied behind every convolutional layers, except for the final convolutional layers before the predictions. In the upsampling blocks, we set conv-1, conv-2, conv-3 and conv-4 with $1\times 1$, $3\times 3$, $5\times 5$ and $7\times 7$ kernel sizes, respectively. Note that we use $3\times 3$ convolution with dilation = 2 to efficiently get a $7\times 7$ receptive field. For the parameters of training loss, we simply use initial values of $\delta _1 = \delta _2 = 0.5$ of Eq. 7 for all scenes, and find that different initial values have no large effects on the performance. Initial learning rate is set to $10^{-5}$ for the pre-trained convolution layers and 0.01 for the other layers. For NYU Depth v2 dataset, we train our model on 4k unique images with coarse semantic labels and depth ground truth in 40K batch iterations, and then fine-tune the model with a learning rate of 0.001 on 795 images with depth and segmentation ground truth in 10K batch iterations. For the SUN-RGBD dataset, we train our model with 50K batch iterations on the initial learning rates, and fine-tune the non-pretrained layers for 30K batch iterations with a learning rate of 0.001. The momentum and weight decay are set to 0.9 and 0.0005 respectively, and the network is trained using SGD with batch size of 16. As there are many missing values in the depth ground truth maps, following the literatures [7, 24], we mask out the pixels that have missing depths both in the training and testing phases.

Metrics: Similar to the previous works [6, 7, 24], we evaluate our depth prediction results with the following metrics:

average relative error (rel): $\frac{1}{n}\sum _i\frac{|\widetilde{x_i}-x_i|}{x_i}$;
root mean squared error (rms): $\sqrt{\frac{1}{n}\sum _i(\widetilde{x_i}-x_i)^2}$;
root mean squared error in log space (rms(log)): $\sqrt{\frac{1}{n}\sum _i(\log {\widetilde{x_i}}-\log {x_i})^2}$;
accuracy with threshold ($\delta $): % of $\widetilde{x_i}$ s.t. max($\frac{\widetilde{x_i}}{x_i}$, $\frac{x_i}{\widetilde{x_i}}$)=$\delta $ $\delta = 1.25, 1.25^2, 1.25^3$;

where $\widetilde{x_i}$ is the predicted depth value at the pixel i, n is the number of valid pixels and $x_i$ is the ground truth.

For the evaluation of semantic segmentation results, we follow the recent works [27, 32, 46] and use the common metrics including pixel accuracy (pixel-acc), mean accuracy (mean-acc) and mean intersection over union (mean-IoU).

4.2 Ablation Study

In this section, we conduct several experiments to evaluate the effectiveness of our proposed method. The concrete ablation studies are introduced in the following.

Analysis on Tasks: We first analyse the benefit of jointly predicting depth and segmentation of one image. The experiments use the same network architecture as our ResNet-18 based network and are trained on NYU Depth v2 and SUN-RGBD datasets for depth estimation and segmentation respectively. As illustrated in Table 1, our proposed TRL network obviously benefits for each other under the joint learning of depth estimation and semantic segmentation. For NYU Depth v2 dataset, compared to the gain on depth estimation, semantic segmentation has a larger gain after the dual-task learning, i.e., the improvement about 4.1% on mean class accuracy and 3.0% on IoU. One possible reason should be more data of 4k depth images than semantic labels of 795 images. In contrast, for SUN-RGBD dataset, all training samples are with depth and semantic ground truth, i.e., the training samples for both tasks are balanced. We can observe that the performance on both tasks can be promoted for each other under the framework of proposed task-recursive learning.

Table 1. Joint task learning v.s. single task learning on NYU depth V2 and SUN-RGBD datasets.

Full size table

Table 2. Comparisons of different network architectures and baselines on NYU depth v2 dataset.

Full size table

Architectures and Baselines: We conduct experiments to analyse the effect of different network architectures. We set the baseline network with the same encoder but two parallel decoders. Each decoder corresponds to one task, which contains four residual blocks using the same type to the original TRL network decoder. To softly share the parameters and interact the two tasks, similar to the method in [19], we use the cross-stitch unit to fuse features at each scale. To evaluate the effectiveness of the task-attentional module, further, we perform an experiment without TAMs. To verify the importance of historical experience at previous stages, we also train a TRL network without any earlier experience (i.e., not considering the TAMs and the features from previous residual blocks). Besides, we also evaluate the prediction ability of other three scales (from scale-1 to scale-3) to show the effectiveness of the coarse-to-fine mechanism. All these experimental models take ResNet-18 as infrastructure. Externally, we also train ResNet-50 and ResNet-101 based TRL networks to analyse the effect of deeper encoding networks.

As reported in Table 2, our proposed TRL network signaficantly performs better than the baseline on both tasks. Compared with the TRL network without TAMs, TRL can obtain a superior performance on both tasks. It indicates that TAMs can potentially take some common patterns of the two tasks to promote the performance. For this, we also visually exhibit the learned attentional map $\text {M}$ from the TAMs. As observed in Fig. 4, the attentional maps have higher attention to objects, edges and boundaries, which are very obvious according to both ground truth maps. These features commonly exist in the two tasks, and thus can make TAMs capture such common information to promote both tasks. For the case without the historical experience mechanism, i.e., TRL w/o exp-TAMs, the original TRL can obtain an accumulative gain of 21.4% on the two tasks, which demonstrates that the experience mechanism is also crucial for the task-recursive learning process. In the cast that TAM has no gate unit, i.e., TRL w/o gate unit, the resulting accuracies are slightly decreased. When the scale increases, i.e., the coarse-to-fine manner, the performances are gradually improved on both tasks. An obvious reason is that details can be better reconstructed in those fine scale space. Further, when more sophisticated and deeper encoders are employed, ResNet-50 and ResNet-101, the proposed TRL network can improve the performance, which can be easily understood as the same observations in other literatures.

For a visual analysis, we show some prediction results of baselines and TRL in Fig. 5. From the figure, we can observe that the segmentation results of the two baselines suffer obvious classification error, especially as shown in the white bounding boxes. In contrast, the prediction results of TRL suffer less class ambiguity and are more reasonable visually. More ablation study and visual results can be found in our supplementary material.

4.3 Comparisons with the State-of-the-Art Methods

In this section we compare our method with several state-of-the-art approaches on both tasks. The experiments are conducted on NYU Depth V2 and SUN-RGBD datasets, which will be discussed below.

Table 3. Comparisons with the state-of-the-art depth estimation approaches on NYU depth v2 dataset.

Full size table

Table 4. Comparisons the state-of-the-art semantic segmentation methods on NYU depth v2 dataset.

Full size table

Table 5. Comparison with the state-of-the-art semantic segmentation methods on SUN-RGBD dataset.

Full size table

Depth Estimation: We compare our depth estimation performance on NYU depth V2 dataset, and summarize the results in Table 3. As observed from this table, our TRL network with ResNet-50 achieves the best performance on the rms, rms(log) and the $\delta $-accuracy metrics, while this version with ResNet-18 also obtains satisfactory results. Compared with the recent method [7], our TRL is slightly inferior in the rel metric, but significantly superior in other metrics, where a total 7.67% relative gain is achieved. It is worth noting that the method in literature [7] used a larger training set which contains 12k unique image and depth pairs, but our model uses only 4k unique images (less than 12k) and still gets a better performance. Compared with the method in [6], we have the same observation that our TRL is slightly poor in rel metric but has obviously better results in all other metrics. Please note that the method in [6] attempted to use more training images (95k) to promote the performance of depth estimation. Nevertheless, if the training data is reduced to 4.7k, the accuracies have an obvious degradation for the method in [6]. In contrast, under the nearly equal size of training data, our TRL can still achieve the best performance in most metrics.

In addition, to provide a visual observation, we show some visual comparison examples in Fig. 6. The prediction results of the methods in [6, 24] usually have much noise, especially at the object boundaries, curtains, sofa and bed. On the contrary, our predictions have less noise and better match the geometry of the scenes. Therefore, these experimental results can demonstrate that our proposed approach is more effective than the state-of-the-art method by borrowing semantic segmentation information.

RGBD Semantic Segmentation: We compare our TRL method with the state-of-the-art approaches on NYU Depth V2 and SUN RGBD datasets. For NYU Depth V2 dataset, as summarized in Table 4, our TRL network with ResNet-50 achieve the best pixel accuracies, but is slightly poor in mean class accuracy metric than the method in [32] and mean IoU metric than the method in [53]. It may be attributed to the imperfect depth predictions. Actually, the methods in [32, 53] used the depth ground truth as the input, and carefully designed some depth-RGB feature fusion strategies to make the segmentation prediction better benefit from the depth ground truth. In contrast, our TRL method uses only RGB images as the input and conduct semantic segmentation based on estimated image depth, not depth ground truth. Although our TRL itself can obtain impressive depth estimation results, the depth estimation is still not as precise as ground truth, which usually results into more or less errors in the segmentation prediction process. Meanwhile, as the number of samples with semantic labels is limited in training for NYU Depth V2 dataset (795 images), the performance may be affected for our method.

For SUN-RGBD dataset, as reported in Table 5, our TRL network with ResNet-101 can reach the best performance in pixel-accuracy and mean IoU metrics. It is worth noting that the number of training samples with semantic labels is 5285 in SUN-RGBD, which is more than NYU Depth V2. Thus the performances on the two tasks are totally better than those on NYU Depth V2 for most methods, including our TRL network. Compared with the method in [53], our TRL with ResNet-50 has a total 2.1% gain for all metrics, while the version with ResNet-101 obtains a total 4.3% gain. Note that, the method in [53] used the stronger ResNet-152 and more precise depth (i.e., ground truth) as inputs, while our TRL network uses only RGB images as the input. Overall, our TRL outperforms the current state-of-the-art methods in most evaluation metrics except the mean accuracy metric, in which ours is slightly poor but comparable.

5 Conclusions

In this paper, a novel end-to-end task-recursive learning framework had been proposed for jointly predicting depth map and semantic segmentation from one RGB image. The task-recursive learning network alternately refined the two tasks as a recursive sequence of time states. To better leverage the correlated and common patterns of depth and semantic segmentation, we also designed a task-attentional module. The module can adaptively mine the common information of the two tasks, encourage both interactive learning, and finally benefit for each other. Comprehensive benchmark evaluations demonstrated the superiority of our task-recursive network on jointly dealing with depth estimation and semantic segmentation. Meantime, we also reported some new state-of-the-art results on NYU-Depth v2 and SUN RGB-D datasets. In future, we will generalize the framework into the joint learning on more tasks.

References

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Michels, J., Saxena, A., Ng, A.Y.: High speed obstacle avoidance using monocular vision and reinforcement learning. In: ICML, pp. 593–600 (2005)
Google Scholar
Hadsell, R., et al.: Learning long-range vision for autonomous off-road driving. J. Field Robot. 26(2), 120–144 (2009)
Article Google Scholar
Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: CVPR, vol. 2, pp. 6565–6574 (2017)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS, pp. 2366–2374 (2014)
Google Scholar
Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In: CVPR, vol. 1, pp. 161–169 (2017)
Google Scholar
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV, pp. 239–248 (2016)
Google Scholar
Zhang, Z., Xu, C., Yang, J., Gao, J., Cui, Z.: Progressive hard-mining network for monocular depth estimation. IEEE Trans. Image Process. 27(8), 3691–3702 (2018)
Article MathSciNet Google Scholar
Zhang, Z., Xu, C., Yang, J., Tai, Y., Chen, L.: Deep hierarchical guidance and regularization learning for end-to-end depth estimation. Pattern Recognit. 83, 430–442 (2018)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)
Article Google Scholar
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV, pp. 1520–1528 (2015)
Google Scholar
Li, X., et al.: FoveaNet: perspective-aware urban scene parsing. In: ICCV, pp. 784–792 (2017)
Google Scholar
Wei, Y., et al.: Learning to segment with image-level annotations. Pattern Recognit. 59, 234–244 (2016)
Article Google Scholar
Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 664–679. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_40
Chapter Google Scholar
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015)
Google Scholar
He, K., Gkioxari, G., Dollr, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Kim, S., Park, K., Sohn, K., Lin, S.: Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 143–159. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_9
Chapter Google Scholar
Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: CVPR, pp. 3994–4003 (2016)
Google Scholar
Shi, J., Pollefeys, M.: Pulling things out of perspective. In: CVPR, pp. 89–96 (2014)
Google Scholar
Wang, P., Shen, X., Lin, Z., Cohen, S.: Towards unified depth and semantic prediction from a single image. In: CVPR, pp. 2800–2809 (2015)
Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv:1705.07115 (2017)
Borst, J.P., Taatgen, N.A., Van Rijn, H.: The problem state: a cognitive bottleneck in multitasking. J. Exp. Psychol. Learn. Mem. Cogn. 36(2), 363 (2010)
Article Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)
Google Scholar
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)
Article Google Scholar
Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: CVPR, pp. 1119–1127 (2015)
Google Scholar
Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015)
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S.: Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: CVPR, pp. 7268–7277 (2018)
Google Scholar
Jin, X., Chen, Y., Jie, Z., Feng, J., Yan, S.: Multi-path feedback recurrent neural networks for scene parsing. In: AAAI, vol. 3, p. 8 (2017)
Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
He, Y., Chiu, W.C., Keuper, M., Fritz, M.: STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling. arXiv preprint arXiv:1604.02388 (2016)
Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K.: Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In: CVPR, vol. 3, pp. 1475–1483 (2017)
Google Scholar
Amit, Y., Fink, M., Srebro, N., Ullman, S.: Uncovering shared structures in multiclass classification. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning, pp. 17–24 (2007)
Google Scholar
Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117 (2004)
Google Scholar
Jalali, A., Ravikumar, P.D., Sanghavi, S., Chao, R.: A dirty model for multi-task learning. In: NIPS, pp. 964–972 (2010)
Google Scholar
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPR Workshops, pp. 512–519 (2014)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS, pp. 3320–3328 (2014)
Google Scholar
Wang, X., Fouhey, D.F., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR, pp. 539–547 (2014)
Google Scholar
Gebru, T., Hoffman, J., Li, F.F.: Fine-grained recognition in the wild: a multi-task domain adaptation approach. arXiv:1709.02476 (2017)
Kokkinos, I.: UberNet: training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: CVPR, pp. 5454–5463 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Wang, F., et al.: Residual attention network for image classification. In: CVPR, pp. 6450–6458 (2017)
Google Scholar
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR, pp. 1874–1883 (2016)
Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, pp. 567–576 (2015)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR, vol. 1, pp. 5168–5177 (2017)
Google Scholar
Roy, A., Todorovic, S.: Monocular depth estimation using neural regression forest. In: CVPR, pp. 5506–5514 (2016)
Google Scholar
Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. In: IEEE Transactions on Circuits and Systems for Video Technology (2017)
Google Scholar
Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR, pp. 3194–3203 (2016)
Google Scholar
Deng, Z., Todorovic, S., Latecki, L.J.: Semantic segmentation of RGBD images with mutex constraints. In: ICCV, pp. 1733–1741 (2015)
Google Scholar
Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 541–557. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_34
Chapter Google Scholar
Xiaojuan, Q., Renjie, L., Jiaya, J., Sanya, F., Raquel, U.: 3D graph neural networks for RGBD semantic segmentation. In: ICCV, pp. 5209–5218 (2017)
Google Scholar
Seong-Jin, P., Ki-Sang, H., Seungyong, L.: RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV, pp. 4990–4999 (2017)
Google Scholar
Di, L., Guangyong, C., Daniel, C.O., Pheng-Ann, H., Hui, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV, pp. 1320–1328 (2017)
Google Scholar

Download references

Acknowledgement

The authors would like to thank the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by the National Natural Science Fund of China under Grant Nos. U1713208, 61472187, 61602244 and 61772276, the 973 Program No. 2014CB349303, the fundamental research funds for the central universities No. 30918011321, and Program for Changjiang Scholars.

Author information

Authors and Affiliations

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Xiang Li & Jian Yang
Tencent AI Lab, Shenzhen, China
Zequn Jie

Authors

Zhenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Cui
View author publications
You can also search for this author in PubMed Google Scholar
Chunyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zequn Jie
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhen Cui or Jian Yang .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 881 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Cui, Z., Xu, C., Jie, Z., Li, X., Yang, J. (2018). Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11214. Springer, Cham. https://doi.org/10.1007/978-3-030-01249-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-01249-6_15
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01248-9
Online ISBN: 978-3-030-01249-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation

Abstract

Similar content being viewed by others

CAENet: Efficient Multi-task Learning for Joint Semantic Segmentation and Depth Estimation

CI-Net: a joint depth estimation and semantic segmentation network using contextual information

KIL: Knowledge Interactiveness Learning for Joint Depth Estimation and Semantic Segmentation

Keywords

1 Introduction

2 Related Work