Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Hoyer, Lukas; Dai, Dengxin; Wang, Qin; Chen, Yuhua; Van Gool, Luc

doi:10.1007/s11263-023-01799-6

Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Open access
Published: 11 May 2023

Volume 131, pages 2070–2096, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Download PDF

Lukas Hoyer ORCID: orcid.org/0000-0002-7391-0676¹,
Dengxin Dai²,
Qin Wang¹,
Yuhua Chen¹ &
…
Luc Van Gool^1,3

4294 Accesses
6 Citations
4 Altmetric
Explore all metrics

Abstract

Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised and domain-adaptive semantic segmentation, which is enhanced by self-supervised monocular depth estimation (SDE) trained only on unlabeled image sequences. In particular, we utilize SDE as an auxiliary task comprehensively across the entire learning framework: First, we automatically select the most useful samples to be annotated for semantic segmentation based on the correlation of sample diversity and difficulty between SDE and semantic segmentation. Second, we implement a strong data augmentation by mixing images and labels using the geometry of the scene. Third, we transfer knowledge from features learned during SDE to semantic segmentation by means of transfer and multi-task learning. And fourth, we exploit additional labeled synthetic data with Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and real data. We validate the proposed model on the Cityscapes dataset, where all four contributions demonstrate significant performance gains, and achieve state-of-the-art results for semi-supervised semantic segmentation as well as for semi-supervised domain adaptation. In particular, with only 1/30 of the Cityscapes labels, our method achieves 92% of the fully-supervised baseline performance and even 97% when exploiting additional data from GTA. The source code is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.

Towards Multi-source Adaptive Semantic Segmentation

DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation

On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

Article Open access 11 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Convolutional Neural Networks (CNNs) (LeCun et al., 1998) have achieved state-of-the-art results for various computer vision tasks including semantic segmentation (Long et al., 2015; Chen et al., 2017). However, training CNNs typically requires large-scale annotated datasets, due to millions of learnable parameters involved. Collecting such training data relies primarily on manual annotation. For semantic segmentation, the process can be particularly costly, due to the required dense annotations. For example, annotating a single image of the Cityscapes dataset took on average 1.5 h (Cordts et al., 2016). For the training set, this sums up to 4460 working hours only for the annotation. For more challenging environmental conditions such as fog, snow, or nighttime, the annotation can be even more expensive. For instance, the annotation of one image of the ACDC dataset (Sakaridis et al., 2021) took 3.3 h on average.

Recently, self-supervised learning (Doersch et al., 2015; Gidaris et al., 2018; He et al., 2020) has shown to be a promising replacement for manually labeled data. It aims to learn representations from the structure of unlabeled data, instead of relying on a supervised loss, which requires manual labels. In particular, the principle has successfully been applied in depth estimation for stereo pairs (Godard et al., 2017) or image sequences (Zhou et al., 2017). Additionally, semantic segmentation is known to be tightly coupled with depth. For example, sky is always far away, traffic lights are usually closer than their surrounding, and depth discontinuities often coincide with semantic segmentation borders. Several works (Vandenhende et al., 2021; Xu et al., 2018; Liu et al., 2019; Chen et al., 2019b) have reported that jointly learning segmentation and supervised depth estimation can benefit the performance of both tasks. Motivated by these observations, we investigate the question: How can we leverage self-supervised depth estimation to improve semantic segmentation?

In this work, we propose to use self-supervised monocular depth estimation (SDE) (Godard et al., 2017; Zhou et al., 2017; Godard et al., 2019) to improve the performance of semantic segmentation and to reduce the number of necessary annotations. For this purpose, we consider the holistic learning process covering data selection for annotation, data augmentation, domain adaptation, and multi-task learning. For each step, we show how SDE can effectively be utilized to improve the semantic segmentation performance. In contrast to most previous works, which only exploit supervised depth information during the multi-task learning (Vandenhende et al., 2021), we resort to self-supervised depth estimation as an auxiliary task comprehensively across the entire learning pipeline and show that it is critical to effectively improve the segmentation performance.

We apply our framework to the semi-supervised learning (SSL) and the semi-supervised domain adaptation (SSDA) setting. In SSL, only a part of the underlying dataset is labeled for semantic segmentation, while in SSDA additional labeled data from another (often synthetic) domain is provided. Figure 1 compares the standard learning pipeline (Fig. 1a) with our SDE-enhanced framework for SSL (Fig. 1b) and our method for SSDA (Fig. 1c).

In our SSL framework (see Fig. 1b), we utilize SDE learned on unlabeled image sequences, to improve the learning pipeline at three places.

First, we propose an automatic data selection for annotation, which selects the most useful samples to be annotated in order to maximize the gain. The selection is iteratively driven by two criteria: diversity and uncertainty. Both of them are conducted by a novel use of SDE as a proxy task in this context. While our method follows the active learning cycle (i.e. model training $\rightarrow $ query selection $\rightarrow $ annotation $\rightarrow $ model training) (Settles, 2009), it does not require a human in the loop to provide semantic segmentation labels as the human is replaced by a proxy-task SDE oracle. This greatly improves flexibility, scalability, and efficiency, especially considering using crowdsourcing platforms for annotation.

Second, we propose a strong data augmentation strategy, DepthMix, which blends images as well as their labels according to the geometry of the scenes obtained from SDE. In comparison to previous methods (Yun et al., 2019; Olsson et al., 2021), DepthMix explicitly respects the geometric structure of the scenes and generates realistic occlusions as the distance of objects to the camera is considered.

And third, we deploy SDE as an auxiliary task for semantic image segmentation under a transfer learning and multi-task learning framework and show that it noticeably improves the performance of semantic segmentation, especially when semantic supervision is limited. Previous works focus on improving SDE instead of semantic segmentation (Chen et al., 2019c; Guizilini et al., 2020b) or only consider the special cases of full supervision (Klingner et al., 2020b) and pretraining (Jiang et al., 2018).

Furthermore, we extend the contributions from SSL to SSDA in order to take advantage of additional synthetic (source) training data (see Fig. 1c). As synthetic data can often be annotated automatically for semantic segmentation, it is a valuable source of supervision and can further reduce the annotation effort for the real (target) data. We demonstrate that the previous contributions are effective for SSDA as well. In order to better bridge the domain gap between source data and target data, we combine the previous Target-Domain DepthMix (i.e. the single-domain DepthMix of our SSL method applied to the target domain) with an additional Cross-Domain DepthMix, which mixes samples from the source domain and the target domain. In that way, our framework is able to align the distribution of labeled source data with labeled target data (Cross-Domain DepthMix) and unlabeled target data with labeled target data (Target-Domain DepthMix). As the geometric distribution of the source domain is not aligned with the target domain and the Cross-Domain DepthMix can suffer from blending samples with different geometric distributions, we further introduce a Matching Geometry Sampling based on SDE to better align the camera pose and scene geometry of the source samples with the target domain.

The main advantage of our method is that we can learn from a large base of easily accessible unlabeled image sequences and use the learned knowledge from SDE to improve semantic segmentation performance over the entire training process. This largely alleviates the need for expensive semantic segmentation annotations. In our experimental evaluation on Cityscapes (Cordts et al., 2016), we demonstrate significant performance gains of all four components and improve the previous state of the art for SSL as well as for SSDA by a considerable margin. Importantly, our contributions are complementary and yield even higher improvements when they are combined in a unified framework. Specifically, in an SSL setting, our method achieves 92% of the fully-supervised model performance with only 1/30 available labels and even slightly outperforms the fully-supervised model with only 1/8 labels. In the SSDA setting with additional supervision from the synthetic GTA5 dataset (Richter et al., 2016), our method achieves even 97% of the fully-supervised model performance with only 1/30 of the target labels.

Our contributions summarize as follows:

(1)
We propose a novel automatic data selection for annotation based on SDE to improve the flexibility of active learning for semantic segmentation. It replaces the human annotator with an SDE oracle and lifts the requirement of having a human in the loop of active learning.
(2)
We propose DepthMix, a strong data augmentation strategy based on self-supervised depth estimation, which respects the geometry of the scene.
(3)
We utilize SDE as an auxiliary task to exploit depth features learned on unlabeled image sequences to significantly improve the performance of semantic segmentation by transfer and multi-task learning. In combination with (1) and (2), we achieve state-of-the-art results for semi-supervised semantic segmentation on Cityscapes.
(4)
We propose a novel semi-supervised domain adaptation method, which combines Target-Domain DepthMix with Cross-Domain DepthMix. Further, Matching Geometry Sampling aligns the camera pose and scene geometry during the mixing process towards the target domain. We show that our method achieves state-of-the-art results for SSDA on GTA5$\rightarrow $Cityscapes and Synthia$\rightarrow $Cityscapes.

This work is an extension of our IEEE Conference on Computer Vision and Pattern Recognition 2021 paper (Hoyer et al., 2021), which focuses on the contributions (1–3). This article further introduces SSDA utilizing SDE both using the previous contributions for SSL as well as the newly proposed combined Cross-Domain/Target-Domain DepthMix and the Matching Geometry Sampling. Also, we extend the ablation studies, detail the analysis (e.g. by class-wise performance insights and by a class frequency analysis of the data selection), and improve the presentation of the unified SDE-enhanced learning framework.

2 Related Work

2.1 Self-supervised Depth Estimation (SDE)

Self-supervised depth estimation (SDE) aims to learn depth estimation from the geometric relations of stereo image pairs (Garg et al., 2016; Godard et al., 2017) or monocular videos (Zhou et al., 2017). Due to the better availability of videos, we use the latter approach, where a neural network estimates the depth and the camera motion of two subsequent images and a photometric loss is computed after a differentiable warping. If the camera intrinsics are not known, their estimation can be incorporated into the learning process as well (Gordon et al., 2019). Follow-up works propose improvements of the loss function (Godard et al., 2019; Gonzalez Bello & Kim, 2020; Shu et al., 2020), network architecture (Wang et al., 2019; Guizilini et al., 2020a), and training scheme (Pilzer et al., 2018, 2019; Casser et al., 2019). To handle dynamic objects, several works (Yin & Shi, 2018; Chen et al., 2019c; Ranjan et al., 2019) extend the projection model and combine depth estimation with optical flow estimation.

2.2 Active Learning

Active learning methods iteratively select the most informative samples to be annotated. Two main directions for the selection heuristic can be differentiated. On the one side, uncertainty-based approaches select samples with a high uncertainty estimated based on, e.g., entropy (Hwa, 2004; Settles & Craven, 2008) or ensemble disagreement (Seung et al., 1992; McCallumzy & Nigamy, 1998). However, this can be prone to querying outliers. On the other side, diversity-based approaches select samples, which most increase the diversity of the labeled set (Sener & Savarese, 2018; Sinha et al., 2019). For segmentation, active learning is typically based on uncertainty measures such as MC dropout (Gal & Ghahramani, 2016; Yang et al., 2017; Mackowiak et al., 2018), entropy (Kasarla et al., 2019; Xie et al., 2020), or multi-view consistency (Siddiqui et al., 2020). In contrast to these works, we perform automatic data selection for annotation by replacing the human with an SDE model as oracle. Therefore, we do not require human-in-the-loop annotation during the active learning cycle. Previous works performing data selection without a human in the loop are restricted to shallow models (Yu et al., 2006; Nie et al., 2013; Li et al., 2018), classification with low-dimensional inputs (Li et al., 2020a), or do not perform an iterative data selection (Zheng et al., 2019) to dynamically adapt to the uncertainty of the model trained on the currently labeled set.

2.3 Semi-supervised Semantic Segmentation

Semi-supervised semantic segmentation makes use of additional unlabeled data during training. An early line of work (Souly et al., 2017; Hung et al., 2018; Mittal et al., 2019) applies generative adversarial networks (Goodfellow et al., 2014) in order to include the unlabeled data into the training.

Another increasingly popular direction is self-training with pseudo-labels (Lee, 2013), which alternates between prediction of pseudo-labels for unlabeled data and model retraining on the (pseudo-)labeled data. To construct the pseudo-labels, a popular approach is the mean teacher framework (Tarvainen & Valpola, 2017). It constructs the teacher network for pseudo-label generation from the exponential moving average of the weights of the student network. In order to avoid lazily mimicking the teacher’s predictions and resisting updates, ATSO (Huo et al., 2021) splits the dataset into two parts, trains a model on each, and uses the model trained on one dataset to label the other. Similarly, CPS (Chen et al., 2021b) utilizes two networks with different initialization to generate the pseudo-labels for each other. Further extensions for self-training include training an additional error correction network (Mendel et al., 2020) and dynamically weighing pseudo-labels according to the agreement between two models (Feng et al., 2020b).

Self-training is often combined with consistency training, where perturbations are applied to unlabeled images or their intermediate features and a loss term enforces consistency of the predictions. For instance, Ouali et al. (2020) study perturbation of encoder features, Lai et al. (2021) enforce consistency of overlapping regions of two crops of the same image with different context, and Sohn et al. (2020) train the model on strongly augmented images while the pseudo-labels were generated only with weak augmentation. This general framework is extended by several strong augmentation strategies designed for semantic segmentation. CutMix (Yun et al., 2019; French et al., 2020) mixes crops from images and their pseudo-labels to generate additional training data, ClassMix (Olsson et al., 2021) uses class segments of pseudo-labels to build the mix mask, and Dvornik et al. (2019) paste instance crops into matching context regions of other images. Our proposed DepthMix module is inspired by these methods but it further respects the geometry of the scene when mixing samples in order to produce realistic occlusions.

2.4 Multi-task Learning of Semantic Segmentation and Self-supervised Depth Estimation

Jointly learning semantic segmentation and SDE was studied in previous works with the goal of improving depth estimation. Several works (Ramirez et al., 2018; Jiao et al., 2018; Yang et al., 2018; Chen et al., 2019a; Klingner et al., 2020b) learn both tasks jointly with a single network. Another line of work (Casser et al., 2019; Guizilini et al., 2020b; Jiang et al., 2019) distills knowledge from a teacher semantic segmentation network to guide SDE. To further promote coherence between semantic segmentation and SDE, Ramirez et al. (2018) and Chen et al. (2019a) propose a loss term to encourage spatial proximity between depth discontinuities and segmentation contours. As moving objects break the static world assumption of the SDE warping process, Casser et al. (2019) and Klingner et al. (2020b) incorporate dynamic object segmentations into the SDE loss calculation.

In contrast to these works, we do not aim to improve SDE but rather semi-supervised semantic segmentation. The closest to our approach are Jiang et al. (2018), Novosel et al. (2019), and Klingner et al. (2020b). Jiang et al. (2018) utilize relative depth computed from optical flow to replace ImageNet pretraining for semantic segmentation. In contrast, we additionally study multi-task learning of SDE and semantic segmentation and show that combining SDE with ImageNet features can further boost performance. Novosel et al. (2019) and Klingner et al. (2020b) improve the semantic segmentation performance by jointly learning with SDE. However, they focus on the fully-supervised setting, while our work explicitly addresses the challenges of semi-supervised semantic segmentation by using the depth estimates to generate additional training data and an automatic data selection mechanism based on SDE. Another work (Klingner et al., 2020a) supports the usefulness of SDE by improving the robustness of semantic segmentation.

2.5 Domain Adaptive Semantic Image Segmentation

A special kind of semi-supervised semantic segmentation is domain adaptation, where the unlabeled and labeled data originate from different domains. Different domains can be, for instance, real and synthetic data (Hoffman et al., 2016) or data captured under different conditions such as daytime/nighttime (Dai & Van Gool, 2018) or weather (Sakaridis et al., 2018). Further, it can be distinguished between unsupervised domain adaptation (UDA), if no labeled target data is available, and semi-supervised domain adaptation (SSDA), if a small number of annotations is available for the target domain.

For semantic segmentation, the better-studied scenario is UDA. In order to overcome the domain shift from the source to the target domain, adversarial training can be applied to the input (Hoffman et al., 2018), feature (Tsai et al., 2018), and output space (Tsai et al., 2018; Vu et al., 2019a). Also, non-adversarial input style transfer methods can be utilized (Yang & Soatto, 2020; Kim & Byun, 2020). An increasingly popular approach for UDA is self-training (Chapelle et al., 2009), where high-confidence predictions of a trained model are used to generate pseudo-labels for unlabeled data to iteratively improve the model (Zou et al., 2018; Wei et al., 2018). DACS (Tranheden et al., 2021) shows that ClassMix (Olsson et al., 2021) can also be applied to images from different domains. In contrast to DACS, our method uses the proposed DepthMix strategy, which respects the geometry of the scene during mixing to avoid geometric artifacts, and it combines Cross-Domain DepthMix with Target-Domain DepthMix for effective SSDA. Furthermore, we propose Matching Geometry Sampling to align the scene geometry and camera perspective for Cross-Domain DepthMix. A similar approach has been developed by Li et al. (2020b) by sampling images from the source domain with a similar semantic layout as the target domain. However, they do not perform data mixing, do not consider the geometry of the scene, and rely on the generalization from the semantic segmentation network trained on the source domain to the target domain in order to perform the semantic layout matching. As we use SDE, which can be trained on both the source and the target domain, our Matching Geometry Sampling lifts this assumption. Further self-training extensions include curriculum learning (Dai & Van Gool, 2018; Zhang et al., 2019; Lian et al., 2019), refining pseudo-labels using uncertainties (Zheng and Yang, 2021), augmentation consistency (Araslanov & Roth, 2021), and class prototypes (Zhang et al., 2021).

In contrast to UDA, semi-supervised domain adaptation (SSDA), where a few annotations are also available for the target domain, is less studied. Kalluri et al. (2019) propose a framework with a domain-shared encoder and a domain-specific decoder with additional entropy minimization in a separate embedding space. Wang et al. (2020) extend adversarial domain alignment from UDA (Tsai et al., 2018) and utilizes the additional target labels by applying feature-level adversarial domain alignment between labeled source and labeled target samples. For that, a spatial and a class-wise discriminator are introduced to mitigate inter-class confusions. To produce a better feature representation, Alonso et al. (2021) extend self-training with a student-teacher framework by contrastive learning (Hadsell et al., 2006). Concurrent to our work, Chen et al. (2021a) propose to train one teacher model on domain-mixed batches and one teacher model on CutMix (Yun et al., 2019; French et al., 2020) batches. A student model is trained on an ensemble of the two teachers and iterative pseudo-labeling is applied to the training of teachers and students. In contrast to these works, our method requires neither sensitive adversarial training nor costly ensemble training. Also, instead of CutMix, we resort to our DepthMix algorithm, which produces geometrically valid synthesized samples. Further, we propose a combined Cross-Domain and Target-Domain DepthMix as well as a Matching Geometry Sampling, which leads to more effective SSDA.

2.6 Auxiliary Depth Estimation for Domain Adaptation

For UDA, depth estimates can be another valuable source of supervision to align the domains. For that purpose, SPIGAN (Lee et al., 2018) and DADA (Vu et al., 2019b) extend domain adversarial learning with privileged depth information from the source domain. GIO-Ada (Chen et al., 2019b) additionally uses the depth information for input style transfer. By providing depth information from the target domain as well, ATDT (Ramirez et al., 2019) learns a bottleneck feature domain transfer network with depth supervision on both domains, which generalizes to semantic segmentation. In contrast to our work, these approaches require depth ground truth, which can be difficult to acquire.

Concurrently to this work, SDE has been studied as an auxiliary task for unsupervised domain adaptation. Guizilini et al. (2021) utilize multi-task learning of semantic segmentation and SDE to learn a more domain-invariant representation. Instead of applying the view synthesis loss from SDE directly, Wang et al. (2021) use depth pseudo-labels from an SDE teacher network to learn depth estimation and semantic segmentation in a multi-tasking framework. To better transfer knowledge between both domains and tasks, the correlation of depth and semantic segmentation features is explicitly transferred from the source to the target domain and the depth adaptation difficulty is transferred to semantic segmentation to weigh the trust in the semantic segmentation pseudo-labels. Using (self-supervised) depth estimation for semi-supervised domain adaptation, however, has not been studied so far.

3 Methods

In this chapter, we present our four approaches to improve the performance of semantic segmentation with self-supervised depth estimation (SDE). They focus on four different aspects of the training process, covering data selection for annotation, data augmentation, multi-task learning, and domain adaptation. Given N images and M image sequences from the same domain, our first method, automatic data selection for annotation, uses SDE learned on the M (unlabeled) sequences to select $N_A$ images out of the N images for human annotation (see Sect. 3.2. Our second approach, termed DepthMix, leverages the learned SDE to create geometrically-sound ‘virtual’ training samples from pairs of labeled images and their annotations (see Sect. 3.4). Our third method learns semantic segmentation with SDE as an auxiliary task under a multi-tasking framework (see Sect. 3.3). The learning is reinforced by a multi-task pretraining process combining SDE with image classification. And fourth, we extend our method to semi-supervised domain adaptation (SSDA) in order to utilize additional synthetic data, which has a low labeling effort (see Sect. 3.5). To address the domain gap, we propose a combined Cross-Domain and Target-Domain DepthMix strategy, which is enhanced by Matching Geometry Sampling.

3.1 Self-supervised Depth Estimation (SDE)

For self-supervised depth estimation (SDE), we follow the method of Godard et al. (2019), which we briefly introduce in the following. We first train a depth estimation network to predict the depth of a target image and a pose estimation network to estimate the camera motion from the target image and the source image. Depth and pose are used to produce a differentiable warping to transform the source image into the target image. The photometric error between the target image and multiple warped source frames is combined by a pixel-wise minimum. Besides, stationary pixels are masked out and an edge-aware depth smoothness term is applied resulting in the final SDE loss $L_D$. We refer the reader to the original paper (Godard et al., 2019) for more details.

3.2 Automatic Data Selection for Annotation

We use SDE as a proxy task for selecting $N_A$ samples out of a set of N unlabeled samples for a human to create semantic segmentation labels. The selection is conducted progressively in multiple steps, similar to the standard active learning cycle (model training $\rightarrow $ query selection $\rightarrow $ annotation $\rightarrow $ model training). However, our data selection is fully automatic and does not require a human in the loop as the annotation is done by a proxy-task SDE oracle as visualized in Fig. 2.

Let’s denote by ${\mathcal {G}}$, ${\mathcal {G}}_A$, and ${\mathcal {G}}_U$, the whole image set, the selected subset for annotation, and the unselected subset. Initially, we have ${\mathcal {G}}_A=\emptyset $ and ${\mathcal {G}}_U={\mathcal {G}}$. The selection is driven by two criteria: diversity and uncertainty. Diversity sampling encourages the selected images to be diverse and cover different scenes. Uncertainty sampling favors adding unlabeled images that are near a decision boundary (with high uncertainties) of the model trained on the current ${\mathcal {G}}_A$. For uncertainty sampling, we need to train and update the model with ${\mathcal {G}}_A$. Specifically, the trained model $f_\text {SIDE}$ solves the proxy task of single-image depth estimation (SIDE) on ${\mathcal {G}}_A$ with supervision from the SDE oracle. It is inefficient to repeat this every time a new image is added. For the sake of efficiency, we divide the selection into T steps and only train the model T times. In each step t, $n_t$ images are selected and moved from ${\mathcal {G}}_U$ to ${\mathcal {G}}_A$, so we have $\sum _{t=1}^T n_t = N_A$. After each step t, a model is trained on ${\mathcal {G}}_A$ and evaluated on ${\mathcal {G}}_U$ to get updated uncertainties for step $t+1$.

3.2.1 Diversity Sampling

To ensure that the chosen annotated samples are diverse enough to represent the entire dataset well, we use an iterative farthest point sampling based on the L2 distance over features $\Phi ^{\text {SDE}}$ computed by an intermediate layer of the SDE network. At step t, for each of the $n_t$ samples, we choose the one in ${\mathcal {G}}_U$ with the largest distance to the current annotation set ${\mathcal {G}}_A$. The set of selected samples ${\mathcal {G}}_A$ is iteratively extended by moving one image at a time from ${\mathcal {G}}_U$ to ${\mathcal {G}}_A$ until the $n_t$ images are collected:

$$\begin{aligned} {\mathcal {G}}_U = {\mathcal {G}}_U \setminus \{I_i\} \text { and } {\mathcal {G}}_A = {\mathcal {G}}_A \cup \{I_i\}, \end{aligned}$$

(1)

$$\begin{aligned} i = {{\,\textrm{arg max}\,}}_{I_i\in {\mathcal {G}}_U} \min _{I_j\in {\mathcal {G}}_A} ||\Phi ^{\text {SDE}}_i - \Phi ^\text {SDE}_j||_2. \end{aligned}$$

(2)

3.2.2 Uncertainty Sampling

While diversity sampling is able to select diverse new samples, it is unaware of the uncertainties of a semantic segmentation model over these samples. Our uncertainty sampling aims to select difficult samples, i.e., samples in ${\mathcal {G}}_U$ that the model trained on the current ${\mathcal {G}}_A$ cannot handle well. In order to train this model, active learning typically uses a human-in-the-loop strategy to add annotations for selected samples. In this work, we use a proxy task based on self-supervised annotations, which can run automatically, to make the method more flexible and efficient. Since our target task is single-image semantic segmentation, we choose to use single-image depth estimation (SIDE) as the proxy task. Importantly, due to our SDE framework, depth pseudo-labels are available for ${\mathcal {G}}$. Using these pseudo-labels, we train a SIDE method on ${\mathcal {G}}_A$ and measure the uncertainty of its depth predictions on ${\mathcal {G}}_U$. Due to the high correlation of single-image semantic segmentation and SIDE, the generated uncertainties are informative and can be used to guide our sampling procedure. For example, if the depth model fails to correctly estimate the depth of a truck because trucks were underrepresented in ${\mathcal {G}}_A$, the semantic segmentation model will probably also struggle to recognize the truck. As the depth student model is trained only on ${\mathcal {G}}_A$, it can specifically approximate the difficulty of candidate samples with respect to the already selected samples in ${\mathcal {G}}_A$. The student is trained from scratch in each step t, instead of being fine-tuned from $t-1$, to avoid getting stuck in the previous local minimum. Note that the SDE method is trained on a much larger unlabeled dataset, i.e., the M image sequences, and can provide good guidance for the SIDE method.

In particular, the uncertainty is signaled by the disparity error between the student network $f_{\text {SIDE}}$ and the teacher network $f_{\text {SDE}}$ in the log-scale space under L1 distance:

$$\begin{aligned} E(i) = || \log (1 + f_{\text {SDE}}(I_i)) - \log (1 + f_{\text {SIDE}}(I_i))||_1. \end{aligned}$$

(3)

As the disparity difference of far-away objects is small, the log-scale is used to avoid the loss being dominated by close-range objects. This criterion can be added into Eq. (2) to also select samples with higher uncertainties for the dataset update in Eq. (1):

$$\begin{aligned} i = {{\,\textrm{arg max}\,}}_{I_i\in {\mathcal {G}}_U} \min _{I_j\in {\mathcal {G}}_A} ||\Phi ^{\text {SDE}}_i - \Phi ^\text {SDE}_j||_2 + \lambda _{\text {E}}E(i), \end{aligned}$$

(4)

where $\lambda _E$ is a parameter to balance the contribution of the two terms. For diversity sampling, we still use SDE features instead of SIDE student features as SDE is trained on the entire dataset, which provides better features for diversity estimation. When $n_t$ images have been selected according to Eqs. (1) and (4) at step t, a new SIDE model will be trained on the current ${\mathcal {G}}_A$ in order to continue further. As presented previously, our selection proceeds progressively in T steps until we collect all $N_A$ images. The algorithm of this selection is summarized in Algorithm 1, where $\sum _{t'=1}^tn_{t'}$ describes the desired size of ${\mathcal {G}}_A$ at the end of step t.

3.3 Learning with Auxiliary Self-supervised Depth Estimation

In this section, we resort to features learned by SDE from unlabeled image sequences to improve the performance of semantic segmentation through transfer and multi-task learning. For that purpose, we use a network with a shared encoder $f^E_\theta $, a separate depth decoder $f^D_\theta $, and a separate segmentation decoder $f^S_\theta $ (see Fig. 3). For effective multi-task learning, useful intermediate features are exchanged between both task-specific decoders. In particular, we use the attention-guided multi-modal distillation module proposed by Xu et al. (2018). Guided by a learned attention map, this module distills features from the depth decoder, which are relevant for semantic segmentation decoder, and induces them into the semantic segmentation decoder. Vice versa, also features from semantic segmentation are distilled and induced in the depth decoder. The depth branch $g^D_\theta = f^D_\theta \circ f^E_\theta $ is trained using the SDE loss $L_D$ and the segmentation branch $g^S_\theta = f^S_\theta \circ f^E_\theta $ is trained using semi-supervise semantic segmentation loss $L_ S $, which we will introduce in Eq. (13) of the next section

$$\begin{aligned} L_ MTL = L_D + L_ S . \end{aligned}$$

(5)

In order to initialize the pose estimation network and the depth branch $g^D_\theta = f^D_\theta \circ f^E_\theta $ properly, the architecture is first only trained on M unlabeled image sequences for SDE. As a common practice, we initialize the encoder with ImageNet weights as they provide useful semantic features learned during image classification. To avoid forgetting these semantic features during the SDE pretraining, we utilize a feature distance loss between the current bottleneck features $f^{E}_\theta $ and the bottleneck features generated by the encoder with ImageNet weights $f^{E}_{I}$

$$\begin{aligned} L_{F} = ||f^{E}_\theta - f^{E}_{I}||_2. \end{aligned}$$

(6)

The loss for the depth pretraining is the weighted sum of the SDE loss and the ImageNet feature distance loss

$$\begin{aligned} L_{D, pretrain } = L_D + \lambda _{F} L_{F}. \end{aligned}$$

(7)

To exploit the features from SDE for semantic segmentation by transfer learning, the weights from SDE $g^{D}_\theta $ are used to initialize the semantic segmentation branch $g^{S}_\theta $.

3.4 DepthMix Data Augmentation

Inspired by the recent success of data augmentation approaches that mixup pairs of images and their (pseudo) labels to generate more training samples for semi-supervised semantic segmentation (Yun et al., 2019; French et al., 2020; Olsson et al., 2021), we propose an algorithm, termed DepthMix, to utilize self-supervised depth estimates to maintain the integrity of the scene structure during mixing.

Given two images $I_i$ and $I_j$ of the same size, we would like to copy some regions from $I_i$ and paste them directly into $I_j$ to get a virtual sample $I^\prime $. The copied regions are indicated by a binary mask M, which has the same size as the two images. The image creation is done as

$$\begin{aligned} I^\prime = M \odot I_i + (1 - M) \odot I_j, \end{aligned}$$

(8)

where $\odot $ denotes the element-wise product. The semantic segmentation labels of the two images $S_i$ and $S_j$ are mixed up with the same mask M to generate the corresponding mixed semantic segmentation

$$\begin{aligned} S^\prime = M \odot S_i + (1 - M) \odot S_j. \end{aligned}$$

(9)

The mixing can be applied to labeled data and unlabeled data using human ground truths or pseudo-labels, respectively. Existing methods generate this mask M in different ways, e.g., randomly sampled rectangular regions (Yun et al., 2019; French et al., 2020) or randomly selected class segments (Olsson et al., 2021). In those methods, the structure of the scene is not considered and foreground and background are not distinguished. We find images synthesized by these methods often violate the geometric relationships between objects. For instance, a distant object can be copied onto a close-range object or only unoccluded parts of mid-range objects are copied onto the other image. Imagine how strange it is to see a pedestrian standing on top of a car or to see the sky through a hole in a building (just as shown in Fig. 4 left).

Our DepthMix is designed to mitigate this issue. It uses the self-supervised depth estimates ${\hat{D}}i$ and ${\hat{D}}j$ of the two images to generate the mask M, which respects the notion of geometry. It is implemented by selecting only pixels from $I_i$ whose depth values are smaller than the depth values of the pixels at the same locations in $I_j$:

$$\begin{aligned} M(a,b) = \left\{ \begin{array}{rl} 1 &{} \text {if } {\hat{D}}_i(a,b) < {\hat{D}}_j(a,b) + \epsilon \\ 0 &{} \text {otherwise } \end{array} \right. \end{aligned}$$

(10)

where a and b are pixel indices, and $\epsilon $ is a small value to avoid conflicts of objects that are naturally at the same depth plane such as road or sky. By using this M, DepthMix respects the depth of objects in both images, such that only closer objects can occlude further-away objects. We illustrate this advantage of DepthMix with an example in Fig. 4.

In order to further take advantage of the unlabeled dataset ${\mathcal {G}}_U$ for DepthMix, we generate pseudo-labels using the mean teacher algorithm (Tarvainen & Valpola, 2017), which is commonly deployed in SSL (Berthelot et al., 2019; Verma et al., 2019; French et al., 2020; Olsson et al., 2021). For that purpose, an exponential moving average is applied to the weights of the semantic segmentation model $g^S_\theta $ to obtain the weights of the mean teacher $\theta _T$:

$$\begin{aligned} \theta '_T = \alpha \theta _T + (1 - \alpha ) \theta . \end{aligned}$$

(11)

To generate the pseudo-labels, an argmax over the classes C is applied to the prediction of the mean teacher:

$$\begin{aligned} S_U = {{\,\textrm{arg max}\,}}_{c \in C}(g^S_{\theta _T}(I_U)). \end{aligned}$$

(12)

The mean teacher can be considered as a temporal ensemble, resulting in stable predictions for the pseudo-labels, while the argmax promotes confident predictions (Olsson et al., 2021).

In order to utilize the pseudo-labels, we apply DepthMix to two samples $(I_i, S_i), (I_j, S_j)$ from the combined labeled and pseudo-labeled data pool ${\mathcal {G}}_A \cup {\mathcal {G}}_U$ to produce a mixed training pair $(I', S')$ according to Eq. (8). The semantic segmentation network is trained using the cross-entropy of labeled samples $(I_A, S_A)$ and the quality-weighted cross-entropy of mixed samples $(I', S')$:

$$\begin{aligned} \begin{aligned} L_ S = L_{ce}(g^S_\theta (I_A), S_A) + q' L_{ce}(g^S_\theta (I'), S'), \end{aligned} \end{aligned}$$

(13)

where $q'$ denotes the estimated quality of the mixed pseudo-label. We follow Olsson et al. (2021) and define $q'$ as the fraction of pixels exceeding a threshold $\tau $ for the predicted probability of the most confident class $P'$:

$$\begin{aligned} q' = \frac{\sum _{a,b} [P'(a, b) > \tau ]}{W\cdot H}. \end{aligned}$$

(14)

As the DepthMix segmentation $S'$ consists of labels from two images, we calculate $P'$ as the mix of its sources:

$$\begin{aligned} P' = M \odot P_i + (1-M) \odot P_j, \end{aligned}$$

(15)

where P is the predicted probability of the most confident class for unlabeled images and 1 for labeled images:

$$\begin{aligned} P(a,b) = {\left\{ \begin{array}{ll} \max _{c \in C}(g^S_{\theta _T}(I)(a,b)),&{} \text {if } I \in {\mathcal {G}}_U\\ 1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(16)

By applying DepthMix to labeled and pseudo-labeled samples, the network is exposed to image regions from both distributions in a single image. This can improve its generalization to the unlabeled data as the context for labeled regions can originate from unlabeled data and vice versa. The improved generalization can lead to better pseudo-labels, which in turn improve the quality of the DepthMix labels.

3.5 Semi-supervised Domain Adaptation (SSDA)

Synthetic data can be another valuable source for low-effort semantic segmentation annotations to reduce the number of expensive target labels. In semi-supervised domain adaptation (SSDA), a neural network is trained to solve a task on the real (target) domain while being trained using a limited number of annotated target samples $(I^{trg}_A, S^{trg}_A)$, further unlabeled target images $I^{trg}_U$, and additional annotated data from the synthetic (source) domain $(I^{src}_A, S^{src}_A)$.

Naively, the semantic segmentation network branch $g_\theta ^S$ can be trained on the labeled samples from both source and target domain using a pixel-wise cross-entropy loss

$$\begin{aligned} L_{ce}^{trg}&= L_{ce}(g_\theta ^S(I^{trg}_A), S^{trg}_A) \,, \end{aligned}$$

(17)

$$\begin{aligned} L_{ce}^{src}&= L_{ce}(g_\theta ^S(I^{src}_A), S^{src}_A) \,. \end{aligned}$$

(18)

However, as the labeled data from the target dataset is limited, the vanilla training strategy suffers from the gap between both domains.

In this work, we propose to use SDE to overcome the domain gap of SSDA. Extending the default setup, we augment both the target and the source dataset with self-supervised depth estimates. For that purpose, an SDE network $f_D^{trg}$ is trained on image sequences from the target domain and another SDE network $f_D^{src}$ is trained on image sequences from the source domain. Note that the image sequences can be different from the images labeled for semantic segmentation. After the SDE training, depth pseudo-labels are inferred for the images of the semantic segmentation datasets: $D^{src}_A = f_D^{src}(I^{src}_A)$; $D^{trg}_U = f_D^{trg}(I^{trg}_U)$; $D^{trg}_A = f_D^{trg}(I^{trg}_A)$. Further, pseudo-labels $S^{trg}_U$ are obtained online according to Eq. (12) for the unlabeled target data. The additional depth and semantic segmentation pseudo-labels are added to the SSDA training data.

Based on this data, we propose a combined Cross-Domain and Target-Domain DepthMix in order to facilitate effective self-training across domains as well as across labeled and unlabeled samples, respectively. Further, we enhance the mixing by Matching Geometry Sampling. The training process is visualized in Fig. 5 and described in the following.

3.5.1 Target-Domain DepthMix (TDM)

Target-Domain DepthMix (TDM) applies the DepthMix algorithm to the target dataset. It mixes labeled and unlabeled target samples to improve the generalization from the labeled target to the unlabeled target samples due to the increased variety of objects in different contexts. Therefore, it can favorably affect the quality of the pseudo-labels. Target-Domain DepthMix uses the same procedure as the single-domain SSL DepthMix described in Sect. 3.4. It produces a mixed sample $(I'_ TDM ,S'_ TDM )$ based on two target samples according to Eqs. (8)–10. The segmentation branch of the network is trained using the pixel-wise cross-entropy on the mixed samples

$$\begin{aligned} L_ TDM = q'_ TDM L_{ce}(g_\theta ^S(I'_ TDM ), S'_ TDM ), \end{aligned}$$

(19)

where $q'_ TDM $ weighs the loss according to the certainty of the pseudo-label as described in Sect. 3.4.

Mixing within a domain is only applied to the target domain and not to the source domain because the mixing serves the purpose of better generalization from labeled to unlabeled samples during the self-training. The source domain already contains many labeled samples. Therefore, self-training augmented by mixing is not necessary.

3.5.2 Cross-Domain DepthMix (CDM)

As there is only a small number of labeled samples available for the target domain, the trained network will still suffer from the gap between the source and target domain. To further align the domains during training, we propose Cross-Domain DepthMix, which mixes samples from both domains. This allows the network to better generalize across domains as both domains are present within each image.

Cross-Domain DepthMix utilizes one target sample and one source sample. If the target image is unlabeled, a pseudo-label is generated according to Eq. (12). The samples are mixed according to Eqs. (8)–10 to generate the cross-domain mixed sample $(I'_ CDM , S'_ CDM )$. The segmentation branch of the network is trained using the pixel-wise cross-entropy on the mixed samples

$$\begin{aligned} L_ CDM = q'_ CDM L_{ce}(g_\theta ^S(I'_ CDM ), S'_ CDM ), \end{aligned}$$

(20)

where $q'_{CDM}$ weighs the loss according to the certainty of the pseudo-label as described in Sect. 3.4.

The final SSDA loss combines all four segmentation losses as well as the SDE loss on the target domain

$$\begin{aligned} L_{SSDA} = L_{ce}^{trg} + L_{ce}^{src} + L_ CDM + L_ TDM + L_D^{trg}, \end{aligned}$$

(21)

where the loss components are weighted equally.

3.5.3 Matching Geometry Data Sampling

For samples from two different domains, the camera pose can differ between the domains as can be seen in the first three rows of Fig. 6. The geometric distribution difference between domains can impede the transfer of knowledge from the source to the target domain. For example, GTA contains samples from the view of a pedestrian while all Cityscapes samples are recorded from a front-facing camera of a car. This leads to different camera perspectives, which can result in unrealistic mixed samples such as a car “flying" in the sky (second row), or samples out of the target distribution such as images captured right in front of a building (third row).

We address this problem by sampling image pairs from the source and the target domain with a similar geometry with respect to the camera. The sampling is guided by the target geometry, which allows us to better match the geometric target distribution with mixed images. We define the geometric difference G(i, j) of two samples i and j as the L1 distance of the log-scale disparity (inverse depth) estimates in camera space

$$\begin{aligned} G(i,j) = ||\log \left( 1+\frac{1}{D_i} \right) - \log \left( 1+\frac{1}{D}_j \right) ||_1, \end{aligned}$$

(22)

which corresponds to the metric used for the uncertainty sampling of our automatic data selection in Eq. (3). When calculating the geometric difference, we exclude the 80 pixels at the top of the image and the 100 pixels at the bottom from the geometric difference. This prevents SDE artifacts in the sky and the hood of the ego car from contaminating the geometric difference. The pixel-wise geometric difference is visualized in the third column of Fig. 6. It can be observed that it is generally higher for samples that do not have a matching geometry or camera perspective.

Based on a single target sample $i^{trg}$ and a set of candidate source samples ${\mathcal {C}}^{src}$, which are both sampled randomly, the source sample with the smallest geometric difference is selected for training

$$\begin{aligned} j^{src} = {{\,\textrm{arg min}\,}}_{c^{src} \in {\mathcal {C}}^{src}} G(i^{trg}, c^{src}). \end{aligned}$$

(23)

As the target sample is fixed during a matching step, it guides the selection towards the target distribution. The number of candidate samples $|C^{src}|$ balances between a good geometric match and a higher sampling diversity. A larger number of candidates results in a potentially better geometric match of the chosen sample, but it reduces the diversity of the chosen samples as it limits the sampling to the set of source samples that have a small geometric distance to the target domain in general.

This Matching Geometry Sampling allows our method to avoid the described issues of naive sampling and results in realistic DepthMix images, which are closer to the target distribution as can be seen in the last row of Fig. 6.

4 Experiment Setup

4.1 Datasets

Cityscapes We mainly evaluate our method on the Cityscapes dataset (Cordts et al., 2016), which consists of 2975 training and 500 validation images with semantic segmentation labels from European street scenes. We downsample the images to $1024 \times 512$ pixels. Besides, random cropping to a size of $512 \times 512$ and random horizontal flipping are used during the training. Importantly, Cityscapes provides 20 unlabeled frames before and 10 after the labeled image, which are used for our SDE training. During the semi-supervised segmentation, only the 2975 images of the core dataset are used. If not stated otherwise, the same processing steps are applied to the following datasets as well.

CamVid The CamVid dataset (Brostow et al., 2009) contains 367 training, 101 validation, and 233 test images with dense semantic segmentation labels for 11 classes from street scenes in Cambridge. To ensure a similar feature resolution as for Cityscapes, we upsample the CamVid images from $480 \times 360$ to $672 \times 512$ pixels and randomly crop them to a size of $512 \times 512$ pixels.

GTA5 The GTA5 dataset (Richter et al., 2016) originates from a computer game, which enabled time-efficient semi-automatic semantic segmentation annotation. It contains about 25k training images labeled using the same 19 classes as Cityscapes. The SDE is trained on another part of the dataset (Richter et al., 2017), which provides image sequences.

Synthia The Synthia dataset (Ros et al., 2016) provides synthetic images with automatically generated annotations from a simulated urban environment. For semantic segmentation, we use the SYNTHIA-RAND-CITYSCAPES subset, which contains 9400 samples labeled with 16 semantic classes common with Cityscapes. Following the standard protocol for domain adaptation, we train our method for the 16 semantic classes that are common with Cityscapes and evaluate on 13 of them. The SDE is trained on the SYNTHIA-SEQS video sequence subset.

4.2 Network Architecture

Our network consists of a shared ResNet101 (He et al., 2016) encoder with output stride 16, a decoder for segmentation, and a decoder for SDE. The decoder consists of an ASPP (Chen et al., 2017) block with dilation rates of 6, 12, and 18 to aggregate features from multiple scales and another four upsampling blocks with skip connections (Ronneberger et al., 2015). For SDE, the upsampling blocks have a disparity side output at the respective scale. For effective multi-task learning, we additionally follow PAD-Net (Xu et al., 2018) and deploy an attention-guided distillation module after the third decoder block. It serves the purpose of exchanging useful features between segmentation and depth estimation. The design of the network architecture was chosen to facilitate both transfer and multi-task learning. To enable effective transfer learning, the task decoder branches have the same architecture and combine elements from typical semantic segmentation architectures such as the ASPP (Chen et al., 2017) as well as the commonly used U-Net decoder structure (Ronneberger et al., 2015) for depth estimation. This allows for pretraining the segmentation decoder branch with SDE and repurposing it for semantic segmentation afterward. For the pose estimation network, we use the same design as in (Godard et al., 2019). For the SDE network on the source domains, we use an output stride of 32 and a reduced number of decoder channels in order to improve convergence.

4.3 Training

For the SDE pretraining, the depth and pose network are trained using the Adam optimizer, a batch size of 4, and an initial learning rate of $1 \times 10^{-4}$, which is divided by 10 after 160k iterations. The SDE loss is calculated on four scales with three subsequent frames. During the first 300k iterations, only the depth decoder and the pose network are trained. Afterwards, the depth encoder is fine-tuned with an ImageNet feature distance $\lambda _F = 1 \times 10^{-2}$ for another 50k iterations. The encoder is initialized with ImageNet weights, either before depth pretraining or before semantic segmentation if depth pretraining is ablated. The baseline is trained with the same hyperparameters but only with a cross-entropy loss on the labeled samples. Its encoder is initialized with ImageNet pretrained weights.

For the semi-supervised multi-task learning, we train the network using SGD with a learning rate of $1 \times 10^{-3}$ for the encoder and depth decoder, $1 \times 10^{-2}$ for the segmentation decoder, and $1 \times 10^{-6}$ for the pose network. The learning rate is reduced by 10 after 30k iterations and the network is trained for another 10k iterations. A momentum of 0.9, a weight decay of $5 \times 10^{-4}$, and a gradient norm clipping to 10 are used. The loss for segmentation and SDE are weighted equally. The mean teacher has $\alpha =0.99$ and within an iteration, the network is trained on a clean labeled and an augmented mixed batch with size 2, respectively. The latter uses DepthMix with $\epsilon = 0.03$, color jitter, and Gaussian blur. If only pseudo-labeling but no mixing is used in an experiment, color jitter and Gaussian blur are still applied to the augmented batch.

For SSDA, the same hyperparameters as in the SSL setting are used. A batch consists of two source samples, two labeled target samples, and two (pseudo-)labeled target samples, which are used to compute $L_ SSDA $ (see Fig. 5). For the Matching Geometry Sampling, the number of random source candidate samples is set to 5: $|C^{src}| = 5$.

4.4 Automatic Data Selection for Annotation

Table 1 Comparison of data selection methods (DS: diversity sampling based on depth features, US: uncertainty sampling based on depth student error)

Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Abstract

Similar content being viewed by others

Towards Multi-source Adaptive Semantic Segmentation

DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation

On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

1 Introduction

2 Related Work

2.1 Self-supervised Depth Estimation (SDE)

2.2 Active Learning

2.3 Semi-supervised Semantic Segmentation

2.4 Multi-task Learning of Semantic Segmentation and Self-supervised Depth Estimation

2.5 Domain Adaptive Semantic Image Segmentation

2.6 Auxiliary Depth Estimation for Domain Adaptation

3 Methods

3.1 Self-supervised Depth Estimation (SDE)

3.2 Automatic Data Selection for Annotation

3.2.1 Diversity Sampling

3.2.2 Uncertainty Sampling

3.3 Learning with Auxiliary Self-supervised Depth Estimation

3.4 DepthMix Data Augmentation

3.5 Semi-supervised Domain Adaptation (SSDA)

3.5.1 Target-Domain DepthMix (TDM)

3.5.2 Cross-Domain DepthMix (CDM)

3.5.3 Matching Geometry Data Sampling

4 Experiment Setup

4.1 Datasets

4.2 Network Architecture

4.3 Training

4.4 Automatic Data Selection for Annotation

5 Results

5.1 Automatic Data Selection for Annotation

5.2 DepthMix Data Augmentation

5.3 Transfer and Multi-task Learning

5.4 Combined Framework for SSL

5.5 Influence of Depth Estimation on SSL Performance

5.6 Comparison with State-of-the-Art SSL Methods

5.7 Learning SDE and Semantic Segmentation on Different Datasets

5.8 Component Study for SSDA

5.9 Comparison with State-of-the-Art SSDA Methods

6 Conclusions

Data availibility

Code Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation