Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a framework for semi-supervised and domain-adaptive semantic segmentation, which is enhanced by self-supervised monocular depth estimation (SDE) trained only on unlabeled image sequences. In particular, we utilize SDE as an auxiliary task comprehensively across the entire learning framework: First, we automatically select the most useful samples to be annotated for semantic segmentation based on the correlation of sample diversity and difficulty between SDE and semantic segmentation. Second, we implement a strong data augmentation by mixing images and labels using the geometry of the scene. Third, we transfer knowledge from features learned during SDE to semantic segmentation by means of transfer and multi-task learning. And fourth, we exploit additional labeled synthetic data with Cross-Domain DepthMix and Matching Geometry Sampling to align synthetic and real data. We validate the proposed model on the Cityscapes dataset, where all four contributions demonstrate significant performance gains, and achieve state-of-the-art results for semi-supervised semantic segmentation as well as for semi-supervised domain adaptation. In particular, with only 1/30 of the Cityscapes labels, our method achieves 92% of the fully-supervised baseline performance and even 97% when exploiting additional data from GTA. The source code is available at https://github.com/lhoyer/improving_segmentation_with_selfsupervised_depth.


Introduction
Convolutional Neural Networks (CNNs) (LeCun et al., 1998) have achieved state-of-the-art results for various computer vision tasks including semantic segmentation (Long et al., 2015;Chen et al., 2017). However, training CNNs typically requires large-scale annotated datasets, due to millions of learnable parameters involved. Collecting such training data relies primarily on manual annotation. For semantic segmentation, the process can be particularly costly, due to the required dense annotations. For example, annotating a single image of the Cityscapes dataset took on average 1.5 hours (Cordts et al., 2016). For the training set, this sums up to 4460 working hours only for the annotation. For more challenging environmental conditions such as fog, snow, or nighttime, the annotation can be even more expensive. For instance, the annotation of one image of the ACDC dataset (Sakaridis et al., 2021) took 3.3 hours on average.
Recently, self-supervised learning (Doersch et al., 2015;Gidaris et al., 2018;He et al., 2020) has shown to be a promising replacement for manually labeled data. It aims to learn representations from the structure of unlabeled data, instead Fig. 1 Our method utilizes self-supervised depth estimation (SDE) in order to improve the holistic learning process of semantic segmentation. In comparison to the standard learning pipeline (a), we learn SDE from unlabeled image sequences and utilize it to improve the data selection, data augmentation, and training process (b). Further, we extend our framework to semi-supervised domain adaptation (SSDA), where SDE is used to align domains by Matching Geometry Sampling and Cross-Domain DepthMix (c). of relying on a supervised loss, which requires manual labels. In particular, the principle has successfully been applied in depth estimation for stereo pairs (Godard et al., 2017) or image sequences (Zhou et al., 2017). Additionally, semantic segmentation is known to be tightly coupled with depth. Several works have reported that jointly learning segmentation and supervised depth estimation can benefit the performance of both tasks (Vandenhende et al., 2021). Motivated by these observations, we investigate the question: How can we leverage self-supervised depth estimation to improve semantic segmentation?
In this work, we propose to utilize self-supervised monocular depth estimation (SDE) (Godard et al., 2017;Zhou et al., 2017;Godard et al., 2019) to improve the performance of semantic segmentation and to reduce the number of necessary annotations. For this purpose, we consider the holistic learning process covering data selection for annotation, data augmentation, domain adaptation, and multi-task learning. For each step, we show how SDE can effectively be utilized to improve the semantic segmentation performance. In contrast to most previous works, which only exploit supervised depth information during the multi-task learning (Vandenhende et al., 2021), we utilize self-supervised depth estimation as an auxiliary task comprehensively across the entire learning pipeline and show that it is critical to effectively improve the segmentation performance.
We apply our framework to the semi-supervised learning (SSL) and the semi-supervised domain adaptation (SSDA) setting. In SSL, only a part of the underlying dataset is labeled for semantic segmentation, while in SSDA additional labeled data from another (often synthetic) domain is provided. Fig. 1 compares the standard learning pipeline ( Fig. 1a) with our SDE-enhanced framework for SSL (Fig. 1b) and our method for SSDA (Fig. 1c).
In our SSL framework (see Fig. 1b), we utilize SDE learned on unlabeled image sequences, to improve the learning pipeline at three places.
First, we propose an automatic data selection for annotation, which selects the most useful samples to be annotated in order to maximize the gain. The selection is iteratively driven by two criteria: diversity and uncertainty. Both of them are conducted by a novel use of SDE as a proxy task in this context. While our method follows the active learning cycle (i.e. model training → query selection → annotation → model training) (Settles, 2009), it does not require a human in the loop to provide semantic segmentation labels as the human is replaced by a proxy-task SDE oracle. This greatly improves flexibility, scalability, and efficiency, especially considering using crowdsourcing platforms for annotation.
Second, we propose a strong data augmentation strategy, DepthMix, which blends images as well as their labels according to the geometry of the scenes obtained from SDE.
In comparison to previous methods (Yun et al., 2019;Olsson et al., 2021), DepthMix explicitly respects the geometric structure of the scenes and generates realistic occlusions as the distance of objects to the camera is considered.
And third, we deploy SDE as an auxiliary task for semantic image segmentation under a transfer learning and multi-task learning framework and show that it noticeably improves the performance of semantic segmentation, especially when semantic supervision is limited. Previous works focus on improving SDE instead of semantic segmentation (Chen et al., 2019c;Guizilini et al., 2020b) or only consider the special cases of full supervision (Klingner et al., 2020b) and pretraining (Jiang et al., 2018).
Furthermore, we extend the contributions from SSL to SSDA in order to utilize additional synthetic (source) training data (see Fig. 1c). As synthetic data can often be annotated automatically for semantic segmentation, it is a valuable source of supervision and can further reduce the annotation effort for the real (target) data. We demonstrate that the previous contributions are effective for SSDA as well. In order to better bridge the domain gap between source data and target data, we combine the previous Target-Domain DepthMix (i.e. the single-domain DepthMix of our SSL method applied to the target domain) with an additional Cross-Domain Depth-Mix, which mixes samples from the source domain and the target domain. In that way, our framework is able to align the distribution of labeled source data with labeled target data (Cross-Domain DepthMix) and unlabeled target data with labeled target data (Target-Domain DepthMix). As the geometric distribution of the source domain is not aligned with the target domain and the Cross-Domain DepthMix can suffer from blending samples with different geometric distributions, we further introduce a Matching Geometry Sampling based on SDE to better align the camera pose and scene geometry of the source samples with the target domain.
The main advantage of our method is that we can learn from a large base of easily accessible unlabeled image sequences and utilize the learned knowledge from SDE to improve semantic segmentation performance over the entire training process. This largely alleviates the need for expensive semantic segmentation annotations. In our experimental evaluation on Cityscapes (Cordts et al., 2016), we demonstrate significant performance gains of all four components and improve the previous state of the art for SSL as well as for SSDA by a considerable margin. Importantly, our contributions are complementary and yield even higher improvements when they are combined in a unified framework. Specifically, in an SSL setting, our method achieves 92% of the fully-supervised model performance with only 1/30 available labels and even slightly outperforms the fully-supervised model with only 1/8 labels. In the SSDA setting with additional supervision from the synthetic GTA5 dataset (Richter et al., 2016), our method achieves even 97% of the fully-supervised model performance with only 1/30 of the target labels.
Our contributions summarize as follows: (1) We propose a novel automatic data selection for annotation based on SDE to improve the flexibility of active learning for semantic segmentation. It replaces the human annotator with an SDE oracle and lifts the requirement of having a human in the loop of active learning.
(2) We propose DepthMix, a strong data augmentation strategy based on self-supervised depth estimation, which respects the geometry of the scene. (3) We utilize SDE as an auxiliary task to exploit depth features learned on unlabeled image sequences to significantly improve the performance of semantic segmentation by transfer and multi-task learning. In combination with (1) and (2), we achieve state-of-the-art results for semi-supervised semantic segmentation on Cityscapes. (4) We propose a novel semi-supervised domain adaptation method, which combines Target-Domain DepthMix with Cross-Domain DepthMix. Further, Matching Geometry Sampling aligns the camera pose and scene geometry during the mixing process towards the target domain. We show that our method achieves state-of-the-art results for SSDA on GTA5→Cityscapes and Synthia→Cityscapes.
This work is an extension of our IEEE Conference on Computer Vision and Pattern Recognition 2021 paper , which focuses on the contributions (1-3). This article further introduces SSDA utilizing SDE both using the previous contributions for SSL as well as the newly proposed combined Cross-Domain / Target-Domain DepthMix and the Matching Geometry Sampling. Also, we extend the ablation studies, detail the analysis (e.g. by class-wise performance insights and by a class frequency analysis of the data selection), and improve the presentation of the unified SDE-enhanced learning framework.
2 Related Work 2.1 Self-Supervised Depth Estimation (SDE) Self-supervised depth estimation (SDE) aims to learn depth estimation from the geometric relations of stereo image pairs (Garg et al., 2016;Godard et al., 2017) or monocular videos (Zhou et al., 2017). Due to the better availability of videos, we use the latter approach, where a neural network estimates the depth and the camera motion of two subsequent images and a photometric loss is computed after a differentiable warping. If the camera intrinsics are not known, their estimation can be incorporated into the learning process as well (Gordon et al., 2019). Follow-up works propose improvements of the loss function (Godard et al., 2019;Gonzalez Bello and Kim, 2020;Shu et al., 2020), network architecture Guizilini et al., 2020a), and training scheme (Pilzer et al., 2018(Pilzer et al., , 2019Casser et al., 2019). To handle dynamic objects, several works (Yin and Shi, 2018;Chen et al., 2019c;Ranjan et al., 2019) extend the projection model and combine depth estimation with optical flow estimation.

Active Learning
Active learning methods iteratively select the most informative samples to be annotated. Two main directions for the selection heuristic can be differentiated. On the one side, uncertainty-based approaches select samples with a high uncertainty estimated based on, e.g., entropy (Hwa, 2004;Settles and Craven, 2008) or ensemble disagreement (Seung et al., 1992;McCallumzy and Nigamy, 1998). However, this can be prone to querying outliers. On the other side, diversitybased approaches select samples, which most increase the diversity of the labeled set (Sener and Savarese, 2018;Sinha et al., 2019). For segmentation, active learning is typically based on uncertainty measures such as MC dropout (Gal and Ghahramani, 2016;Yang et al., 2017;Mackowiak et al., 2018), entropy (Kasarla et al., 2019;Xie et al., 2020), or multi-view consistency (Siddiqui et al., 2020). In contrast to these works, we perform automatic data selection for annotation by replacing the human with an SDE model as oracle. Therefore, we do not require human-in-the-loop annotation during the active learning cycle. Previous works performing data selection without a human in the loop are restricted to shallow models (Yu et al., 2006;Nie et al., 2013;Li et al., 2018), classification with low-dimensional inputs (Li et al., 2020a), or do not perform an iterative data selection (Zheng et al., 2019) to dynamically adapt to the uncertainty of the model trained on the currently labeled set.

Semi-Supervised Semantic Segmentation
Semi-supervised semantic segmentation makes use of additional unlabeled data during training. An early line of work (Souly et al., 2017;Hung et al., 2018;Mittal et al., 2019) utilizes generative adversarial networks (Goodfellow et al., 2014) in order to include the unlabeled data into the training. Another increasingly popular direction is self-training with pseudo-labels (Lee, 2013), which alternates between prediction of pseudo-labels for unlabeled data and model retraining on the (pseudo-)labeled data. To construct the pseudo-labels, a popular approach is the mean teacher framework (Tarvainen and Valpola, 2017). It constructs the teacher network for pseudo-label generation from the exponential moving average of the weights of the student network. In order to avoid lazily mimicking the teacher's predictions and resisting updates, ATSO (Huo et al., 2021) splits the dataset into two parts, trains a model on each, and uses the model trained on one dataset to label the other. Similarly, CPS (Chen et al., 2021b) utilizes two networks with different initialization to generate the pseudo-labels for each other. Further extensions for self-training include training an additional error correction network (Mendel et al., 2020) and dynamically weighing pseudo-labels according to the agreement between two models (Feng et al., 2020b).
Self-training is often combined with consistency training, where perturbations are applied to unlabeled images or their intermediate features and a loss term enforces consistency of the predictions. For instance, Ouali et al. (2020) study perturbation of encoder features, Lai et al. (2021) enforce consistency of overlapping regions of two crops of the same image with different context, and Sohn et al. (2020) train the model on strongly augmented images while the pseudo-labels were generated only with weak augmentation. This general framework is extended by several strong augmentation strategies designed for semantic segmentation. CutMix (Yun et al., 2019;French et al., 2020) mixes crops from images and their pseudo-labels to generate additional training data, ClassMix  uses class segments of pseudo-labels to build the mix mask, and Dvornik et al. (2019) paste instance crops into matching context regions of other images. Our proposed DepthMix module is inspired by these methods but it further respects the geometry of the scene when mixing samples in order to produce realistic occlusions.

Multi-Task Learning of Semantic Segmentation and Self-Supervised Depth Estimation
Jointly learning semantic segmentation and SDE was studied in previous works with the goal of improving depth estimation. Several works (Ramirez et al., 2018;Jiao et al., 2018;Yang et al., 2018;Chen et al., 2019a;Klingner et al., 2020b) learn both tasks jointly with a single network. Another line of work (Casser et al., 2019;Guizilini et al., 2020b;Jiang et al., 2019) distills knowledge from a teacher semantic segmentation network to guide SDE. To further utilize coherence between semantic segmentation and SDE, Ramirez et al. (2018) and Chen et al. (2019a) propose a loss term to encourage spatial proximity between depth discontinuities and segmentation contours. As moving objects break the static world assumption of the SDE warping process, Casser et al. (2019) and Klingner et al. (2020b) incorporate dynamic object segmentations into the SDE loss calculation.
In contrast to these works, we do not aim to improve SDE but rather semi-supervised semantic segmentation. The closest to our approach are Jiang et al. (2018), Novosel et al. (2019, and Klingner et al. (2020b). Jiang et al. (2018) utilize relative depth computed from optical flow to replace Ima-geNet pretraining for semantic segmentation. In contrast, we additionally study multi-task learning of SDE and semantic segmentation and show that combining SDE with ImageNet features can further boost performance. Novosel et al. (2019) and Klingner et al. (2020b) improve the semantic segmentation performance by jointly learning with SDE. However, they focus on the fully-supervised setting, while our work explicitly addresses the challenges of semi-supervised semantic segmentation by using the depth estimates to generate additional training data and an automatic data selection mechanism based on SDE. Another work (Klingner et al., 2020a) supports the usefulness of SDE by improving the robustness of semantic segmentation.

Domain Adaptive Semantic Image Segmentation
A special kind of semi-supervised semantic segmentation is domain adaptation, where the unlabeled and labeled data originate from different domains. Different domains can be, for instance, real and synthetic data (Hoffman et al., 2016) or data captured under different conditions such as daytime/nighttime  or weather (Sakaridis et al., 2018). Further, it can be distinguished between unsupervised domain adaptation (UDA), if no labeled target data is available, and semi-supervised domain adaptation (SSDA), if a small number of annotations is available for the target domain.
For semantic segmentation, the better-studied scenario is UDA. In order to overcome the domain shift from the source to the target domain, adversarial training can be applied to the input (Hoffman et al., 2018), feature , and output space Vu et al., 2019a). Also, nonadversarial input style transfer methods can be utilized (Yang and Soatto, 2020;Kim and Byun, 2020). An increasingly popular approach for UDA is self-training (Chapelle et al., 2009), where high-confidence predictions of a trained model are used to generate pseudo-labels for unlabeled data to iteratively improve the model (Zou et al., 2018;Wei et al., 2018). DACS  shows that ClassMix  can also be applied to images from different domains. In contrast to DACS, our method uses the proposed DepthMix strategy, which respects the geometry of the scene during mixing to avoid geometric artifacts, and it combines Cross-Domain DepthMix with Target-Domain DepthMix for effective SSDA. Furthermore, we propose Matching Geometry Sampling to align the scene geometry and camera perspective for Cross-Domain DepthMix. A similar approach has been developed by Li et al. (2020b) by sampling images from the source domain with a similar semantic layout as the target domain. However, they do not perform data mixing, do not consider the geometry of the scene, and rely on the generalization from the semantic segmentation network trained on the source domain to the target domain in order to perform the semantic layout matching. As we use SDE, which can be trained on both the source and the target domain, our Matching Geometry Sampling lifts this assumption.
In contrast to UDA, semi-supervised domain adaptation (SSDA), where a few annotations are also available for the target domain, is less studied. Kalluri et al. (2019) propose a framework with a domain-shared encoder and a domainspecific decoder with additional entropy minimization in a separate embedding space. Wang et al. (2020) extend adversarial domain alignment from UDA  and utilizes the additional target labels by applying feature-level adversarial domain alignment between labeled source and labeled target samples. For that, a spatial and a class-wise discriminator are introduced to mitigate inter-class confusions. To produce a better feature representation, Alonso et al. (2021) extend self-training with a student-teacher framework by contrastive learning (Hadsell et al., 2006). Concurrent to our work, Chen et al. (2021a) propose to train one teacher model on domain-mixed batches and one teacher model on CutMix (Yun et al., 2019;French et al., 2020) batches. A student model is trained on an ensemble of the two teachers and iterative pseudo-labeling is applied to the training of teachers and students. In contrast to these works, our method requires neither sensitive adversarial training nor costly ensemble training. Also, instead of CutMix, we utilize our DepthMix algorithm, which produces geometrically valid synthesized samples. Further, we propose a combined Cross-Domain and Target-Domain DepthMix as well as a Matching Geometry Sampling, which leads to more effective SSDA.

Auxiliary Depth Estimation for Domain Adaptation
For UDA, depth estimates can be another valuable source of supervision to align the domains. For that purpose, SPIGAN (Lee et al., 2018) and DADA (Vu et al., 2019b) extend domain adversarial learning with privileged depth information from the source domain. GIO-Ada (Chen et al., 2019b) additionally utilizes the depth information for input style transfer. By providing depth information from the target domain as well, ATDT (Ramirez et al., 2019) learns a bottleneck feature domain transfer network with depth supervision on both domains, which generalizes to semantic segmentation. In contrast to our work, these approaches require depth ground truth, which can be difficult to acquire.
Concurrently to this work, SDE has been studied as an auxiliary task for unsupervised domain adaptation. Guizilini et al. (2021) utilize multi-task learning of semantic segmentation and SDE to learn a more domain-invariant representation. Instead of applying the view synthesis loss from SDE directly, Wang et al. (2021) use depth pseudo-labels from an SDE teacher network to learn depth estimation and semantic segmentation in a multi-tasking framework. To better transfer knowledge between both domains and tasks, the correlation of depth and semantic segmentation features is explicitly transferred from the source to the target domain and the depth adaptation difficulty is transferred to semantic segmentation to weigh the trust in the semantic segmentation pseudo-labels. Using (self-supervised) depth estimation for semi-supervised domain adaptation, however, has not been studied so far.

Methods
In this chapter, we present our four approaches to improve the performance of semantic segmentation with self-supervised depth estimation (SDE). They focus on four different aspects of the training process, covering data selection for annotation, data augmentation, multi-task learning, and domain adaptation. Given N images and M image sequences from the same domain, our first method, automatic data selection for annotation, uses SDE learned on the M (unlabeled) sequences to select N A images out of the N images for human annotation (see Sec. 3.2). Our second approach, termed DepthMix, leverages the learned SDE to create geometrically-sound 'virtual' training samples from pairs of labeled images and their annotations (see Sec. 3.3). Our third method learns semantic segmentation with SDE as an auxiliary task under a multitasking framework (see Sec. 3.4). The learning is reinforced by a multi-task pretraining process combining SDE with image classification. And fourth, we extend our method to semi-supervised domain adaptation (SSDA) in order to utilize additional synthetic data, which has a low labeling effort (see Sec. 3.5). To address the domain gap, we propose a combined Cross-Domain and Target-Domain DepthMix strategy, which is enhanced by Matching Geometry Sampling.

Self-Supervised Depth Estimation (SDE)
For self-supervised depth estimation (SDE), we follow the method of Godard et al. (2019), which we briefly introduce in the following. We first train a depth estimation network to predict the depth of a target image and a pose estimation network to estimate the camera motion from the target image and the source image. Depth and pose are used to produce a differentiable warping to transform the source image into the target image. The photometric error between the target image and multiple warped source frames is combined by a pixel-wise minimum. Besides, stationary pixels are masked out and an edge-aware depth smoothness term is applied resulting in the final SDE loss L D . We refer the reader to the original paper (Godard et al., 2019) for more details.

Fig. 2
The automatic data selection for annotation process selects the most useful samples from the set of unlabeled data G U to be annotated. In contrast to active learning, the human annotator is replaced by an SDE oracle, and the samples are selected according to depth estimation as proxy-task. This lifts the requirement of a human in the loop. Samples are selected according to SDE feature diversity (Sec. 3.2.1) and depth student uncertainty (Sec. 3.2.2).

Automatic Data Selection for Annotation
We use SDE as a proxy task for selecting N A samples out of a set of N unlabeled samples for a human to create semantic segmentation labels. The selection is conducted progressively in multiple steps, similar to the standard active learning cycle (model training → query selection → annotation → model training). However, our data selection is fully automatic and does not require a human in the loop as the annotation is done by a proxy-task SDE oracle as visualized in Fig. 2. Let's denote by G, G A , and G U , the whole image set, the selected subset for annotation, and the unselected subset. Initially, we have G A = ∅ and G U = G. The selection is driven by two criteria: diversity and uncertainty. Diversity sampling encourages the selected images to be diverse and cover different scenes. Uncertainty sampling favors adding unlabeled images that are near a decision boundary (with high uncertainties) of the model trained on the current G A . For uncertainty sampling, we need to train and update the model with G A . It is inefficient to repeat this every time a new image is added. For the sake of efficiency, we divide the selection into T steps and only train the model T times. In each step t, n t images are selected and moved from G U to G A , so we have T t=1 n t = N A . After each step t, a model is trained on G A and evaluated on G U to get updated uncertainties for step t + 1.

Diversity Sampling
To ensure that the chosen annotated samples are diverse enough to represent the entire dataset well, we use an iterative farthest point sampling based on the L2 distance over features Φ SDE computed by an intermediate layer of the SDE network. At step t, for each of the n t samples, we choose the one in G U with the largest distance to the current annotation set G A . The set of selected samples G A is iteratively extended Algorithm 1 Automatic Data Selection end if 10: if t == 1 then 11: Obtain index i according to Eq. 2 12: else 13: Obtain index i according to Eq. 4 14: end if 15: by moving one image at a time from G U to G A until the n t images are collected:

Uncertainty Sampling
While diversity sampling is able to select diverse new samples, it is unaware of the uncertainties of a semantic segmentation model over these samples. Our uncertainty sampling aims to select difficult samples, i.e., samples in G U that the model trained on the current G A cannot handle well. In order to train this model, active learning typically uses a human-inthe-loop strategy to add annotations for selected samples. In this work, we use a proxy task based on self-supervised annotations, which can run automatically, to make the method more flexible and efficient. Since our target task is singleimage semantic segmentation, we choose to use single-image depth estimation (SIDE) as the proxy task. Importantly, due to our SDE framework, depth pseudo-labels are available for G. Using these pseudo-labels, we train a SIDE method on G A and measure the uncertainty of its depth predictions on G U . Due to the high correlation of single-image semantic segmentation and SIDE, the generated uncertainties are informative and can be used to guide our sampling procedure. As the depth student model is trained only on G A , it can specifically approximate the difficulty of candidate samples with respect to the already selected samples in G A . The student is trained from scratch in each step t, instead of being fine-tuned from t − 1, to avoid getting stuck in the previous local minimum. Note that the SDE method is trained on a much larger unlabeled dataset, i.e., the M image sequences, and can provide good guidance for the SIDE method.
In particular, the uncertainty is signaled by the disparity error between the student network f SIDE and the teacher network f SDE in the log-scale space under L1 distance: As the disparity difference of far-away objects is small, the log-scale is used to avoid the loss being dominated by closerange objects. This criterion can be added into Eq. 2 to also select samples with higher uncertainties for the dataset update in Eq. 1: where λ E is a parameter to balance the contribution of the two terms. For diversity sampling, we still use SDE features instead of SIDE student features as SDE is trained on the entire dataset, which provides better features for diversity estimation. When n t images have been selected according to Eq. 1 and Eq. 4 at step t, a new SIDE model will be trained on the current G A in order to continue further. As presented previously, our selection proceeds progressively in T steps until we collect all N A images. The algorithm of this selection is summarized in Alg. 1, where t t =1 n t describes the desired size of G A at the end of step t.

DepthMix Data Augmentation
Inspired by the recent success of data augmentation approaches that mixup pairs of images and their (pseudo) labels to generate more training samples for semi-supervised semantic segmentation (Yun et al., 2019;French et al., 2020;Olsson et al., 2021), we propose an algorithm, termed Depth-Mix, to utilize self-supervised depth estimates to maintain the integrity of the scene structure during mixing.
Given two images I i and I j of the same size, we would like to copy some regions from I i and paste them directly into I j to get a virtual sample I . The copied regions are indicated by a binary mask M , which has the same size as the two images. The image creation is done as where denotes the element-wise product. The semantic segmentation labels of the two images S i and S j are mixed up with the same mask M to generate the corresponding mixed semantic segmentation The mixing can be applied to labeled data and unlabeled data using human ground truths or pseudo-labels, respectively. Existing methods generate this mask M in different ways, e.g., randomly sampled rectangular regions (Yun et al., 2019; Concept of the proposed DepthMix data augmentation (refer to Sec. 3.3) and its baseline ClassMix  shown for the mixing of the semantic segmentation labels. By utilizing SDE, DepthMix mitigates geometric artifacts such as missing occluders (busshaped hole in the building) or missing occlusion (legs of the person).
The corresponding images are mixed in the same way. French et al., 2020) or randomly selected class segments . In those methods, the structure of the scene is not considered and foreground and background are not distinguished. We find images synthesized by these methods often violate the geometric relationships between objects. For instance, a distant object can be copied onto a close-range object or only unoccluded parts of mid-range objects are copied onto the other image. Imagine how strange it is to see a pedestrian standing on top of a car or to see the sky through a hole in a building (just as shown in Fig. 3

left).
Our DepthMix is designed to mitigate this issue. It uses the self-supervised depth estimatesDi andDj of the two images to generate the mask M , which respects the notion of geometry. It is implemented by selecting only pixels from I i whose depth values are smaller than the depth values of the pixels at the same locations in I j : where a and b are pixel indices, and is a small value to avoid conflicts of objects that are naturally at the same depth plane such as road or sky. By using this M , DepthMix respects the depth of objects in both images, such that only closer objects can occlude further-away objects. We illustrate this advantage of DepthMix with an example in Fig. 3. In order to further utilize the unlabeled dataset G U for DepthMix, we generate pseudo-labels using the mean teacher algorithm (Tarvainen and Valpola, 2017), which is commonly deployed in SSL (Berthelot et al., 2019;Verma et al., 2019;French et al., 2020;Olsson et al., 2021). For that purpose, an exponential moving average is applied to the weights of the semantic segmentation model g S θ to obtain the weights of the mean teacher θ T : To generate the pseudo-labels, an argmax over the classes C is applied to the prediction of the mean teacher: The mean teacher can be considered as a temporal ensemble, resulting in stable predictions for the pseudo-labels, while the argmax promotes confident predictions . In order to utilize the pseudo-labels, we apply Depth-Mix to two samples (I i , S i ), (I j , S j ) from the combined labeled and pseudo-labeled data pool G A ∪ G U to produce a mixed training pair (I , S ) according to Eq. 5. The semantic segmentation network is trained using the cross-entropy of labeled samples (I A , S A ) and the quality-weighted crossentropy of mixed samples (I , S ): where q denotes the estimated quality of the mixed pseudolabel. It is the fraction of pixels exceeding a threshold τ for the predicted probability of the most confident class P : As the DepthMix segmentation S consists of labels from two images, we calculate P as the mix of its sources: where P is the predicted probability of the most confident class for unlabeled images and 1 for labeled images: By applying DepthMix to labeled and pseudo-labeled samples, the network is exposed to image regions from both distributions in a single image. This can improve its generalization to the unlabeled data as the context for labeled regions can originate from unlabeled data and vice versa. The improved generalization can lead to better pseudo-labels, which in turn improve the quality of the DepthMix labels.

Learning with Auxiliary Self-Supervised Depth Estimation
In this section, we utilize features learned by SDE from unlabeled image sequences to improve the performance of semantic segmentation through transfer and multi-task learning. For that purpose, we use a network with a shared encoder f E θ , a separate depth decoder f D θ , and a separate segmentation decoder f S θ (see Fig. 4). For effective multi-task learning, we use an attention-guided distillation module  to exchange useful intermediate features between both decoders. The depth branch In order to initialize the pose estimation network and the depth branch g D θ = f D θ • f E θ properly, the architecture is first only trained on M unlabeled image sequences for SDE. As a common practice, we initialize the encoder with ImageNet weights as they provide useful semantic features learned during image classification. To avoid forgetting these semantic features during the SDE pretraining, we utilize a feature distance loss between the current bottleneck features f E θ and the bottleneck features generated by the encoder with ImageNet weights f E The loss for the depth pretraining is the weighted sum of the SDE loss and the ImageNet feature distance loss To exploit the features from SDE for semantic segmentation by transfer learning, the weights from SDE g D θ are used to initialize the semantic segmentation branch g S θ .

Semi-Supervised Domain Adaptation (SSDA)
Synthetic data can be another valuable source for low-effort semantic segmentation annotations to reduce the number of expensive target labels. In semi-supervised domain adaptation (SSDA), a neural network is trained to solve a task on the real (target) domain while being trained using a limited number of annotated target samples (I trg A , S trg A ), further unlabeled target images I trg U , and additional annotated data from the synthetic (source) domain (I src A , S src A ). Naively, the semantic segmentation network branch g S θ can be trained on the labeled samples from both source and target domain using a pixel-wise cross-entropy loss However, as the labeled data from the target dataset is limited, the vanilla training strategy suffers from the gap between both domains.
In this work, we propose to use SDE to overcome the domain gap of SSDA. Extending the default setup, we augment both the target and the source dataset with self-supervised depth estimates. For that purpose, an SDE network f trg D is trained on image sequences from the target domain and another SDE network f src D is trained on image sequences from the source domain. Note that the image sequences can be different from the images labeled for semantic segmentation. After the SDE training, depth pseudo-labels are inferred for the images of the semantic segmentation datasets: . Further, pseudo-labels S trg U are obtained online according to Eq. 9 for the unlabeled target data. The additional depth and semantic segmentation pseudo-labels are added to the SSDA training data.
Based on this data, we propose a combined Cross-Domain and Target-Domain DepthMix in order to facilitate effective self-training across domains as well as across labeled and unlabeled samples, respectively. Further, we enhance the mixing by Matching Geometry Sampling. The training process is visualized in Fig. 5 and described in the following.

Target-Domain DepthMix (TDM)
Target-Domain DepthMix (TDM) applies the DepthMix algorithm to the target dataset. It mixes labeled and unlabeled target samples to improve the generalization from the labeled target to the unlabeled target samples due to the increased variety of objects in different contexts. Therefore, it can favorably affect the quality of the pseudo-labels. Target-Domain DepthMix uses the same procedure as the single-domain SSL DepthMix described in Sec. 3.3. It produces a mixed sample (I TDM , S TDM ) based on two target samples according to While CDM applies DepthMix to samples from source and target domain to align both domains, TDM mixes labeled and pseudo-labeled samples from the target domain to align labeled and unlabeled target data. The network is trained on clean labeled source data, CDM/TDM data, and clean labeled target data for semantic segmentation. The target semantic segmentation pseudo-labels are obtained online using a mean teacher network.
Eq. 5 -7. The segmentation branch of the network is trained using the pixel-wise cross-entropy on the mixed samples where q TDM weighs the loss according to the certainty of the pseudo-label as described in Sec. 3.3.
Mixing within a domain is only applied to the target domain and not to the source domain because the mixing serves the purpose of better generalization from labeled to unlabeled samples during the self-training. The source domain already contains many labeled samples. Therefore, self-training augmented by mixing is not necessary.

Cross-Domain DepthMix (CDM)
As there is only a small number of labeled samples available for the target domain, the trained network will still suffer from the gap between the source and target domain. To further align the domains during training, we propose Cross-Domain DepthMix, which mixes samples from both domains. This allows the network to better generalize across domains as both domains are present within each image.
Cross-Domain DepthMix utilizes one target sample and one source sample. If the target image is unlabeled, a pseudolabel is generated according to Eq. 9. The samples are mixed according to Eq. 5 -7 to generate the cross-domain mixed sample (I CDM , S CDM ). The segmentation branch of the network is trained using the pixel-wise cross-entropy on the mixed samples where q CDM weighs the loss according to the certainty of the pseudo-label as described in Sec. 3.3.

Non-Matching Geometry
Matching Geometry The final SSDA loss combines all four segmentation losses as well as the SDE loss on the target domain where the loss components are weighted equally.

Matching Geometry Data Sampling
For samples from two different domains, the camera pose can differ between the domains as can be seen in the first three rows of Fig. 6. The geometric distribution difference between domains can impede the transfer of knowledge from the source to the target domain. For example, GTA contains samples from the view of a pedestrian while all Cityscapes samples are recorded from a front-facing camera of a car. This leads to different camera perspectives, which can result in unrealistic mixed samples such as a car "flying" in the sky (second row), or samples out of the target distribution such as images captured right in front of a building (third row). We address this problem by sampling image pairs from the source and the target domain with a similar geometry with respect to the camera. The sampling is guided by the target geometry, which allows us to better match the geometric target distribution with mixed images. We define the geometric difference G(i, j) of two samples i and j as the L1 distance of the log-scale disparity (inverse depth) estimates in camera space which corresponds to the metric used for the uncertainty sampling of our automatic data selection in Eq. 3. When calculating the geometric difference, we exclude the 80 pixels at the top of the image and the 100 pixels at the bottom from the geometric difference. This prevents SDE artifacts in the sky and the hood of the ego car from contaminating the geometric difference. The pixel-wise geometric difference is visualized in the third column of Fig. 6. It can be observed that it is generally higher for samples that do not have a matching geometry or camera perspective. Based on a single target sample i trg and a set of candidate source samples C src , which are both sampled randomly, the source sample with the smallest geometric difference is selected for training As the target sample is fixed during a matching step, it guides the selection towards the target distribution. The number of candidate samples |C src | balances between a good geometric match and a higher sampling diversity. A larger number of candidates results in a potentially better geometric match of the chosen sample, but it reduces the diversity of the chosen samples as it limits the sampling to the set of source samples that have a small geometric distance to the target domain in general.
This Matching Geometry Sampling allows our method to avoid the described issues of naive sampling and results in realistic DepthMix images, which are closer to the target distribution as can be seen in the last row of Fig. 6.

Datasets
Cityscapes: We mainly evaluate our method on the Cityscapes dataset (Cordts et al., 2016), which consists of 2975 training and 500 validation images with semantic segmentation labels from European street scenes. We downsample the images to 1024 × 512 pixels. Besides, random cropping to a size of 512 × 512 and random horizontal flipping are used during the training. Importantly, Cityscapes provides 20 unlabeled frames before and 10 after the labeled image, which are used for our SDE training. During the semi-supervised segmentation, only the 2975 images of the core dataset are used. If not stated otherwise, the same processing steps are applied to the following datasets as well.
CamVid: The CamVid dataset (Brostow et al., 2009) contains 367 training, 101 validation, and 233 test images with dense semantic segmentation labels for 11 classes from street scenes in Cambridge. To ensure a similar feature resolution as for Cityscapes, we upsample the CamVid images from 480 × 360 to 672 × 512 pixels and randomly crop them to a size of 512 × 512 pixels. GTA5: The GTA5 dataset (Richter et al., 2016) originates from a computer game, which enabled time-efficient semi-automatic semantic segmentation annotation. It contains about 25k training images labeled using the same 19 classes as Cityscapes. The SDE is trained on another part of the dataset (Richter et al., 2017), which provides image sequences. Synthia: The Synthia dataset (Ros et al., 2016) provides synthetic images with automatically generated annotations from a simulated urban environment. For semantic segmentation, we use the SYNTHIA-RAND-CITYSCAPES subset, which contains 9,400 samples labeled with 16 semantic classes common with Cityscapes. Following the standard protocol for domain adaptation, we train our method for the 16 semantic classes that are common with Cityscapes and evaluate on 13 of them. The SDE is trained on the SYNTHIA-SEQS video sequence subset.

Network Architecture
Our network consists of a shared ResNet101 (He et al., 2016) encoder with output stride 16, a decoder for segmentation, and a decoder for SDE. The decoder consists of an ASPP  block with dilation rates of 6, 12, and 18 to aggregate features from multiple scales and another four upsampling blocks with skip connections (Ronneberger et al., 2015). For SDE, the upsampling blocks have a disparity side output at the respective scale. For effective multi-task learning, we additionally follow PAD-Net  and deploy an attention-guided distillation module after the third decoder block. It serves the purpose of exchanging useful features between segmentation and depth estimation. The design of the network architecture was chosen to facilitate both transfer and multi-task learning. To enable effective transfer learning, the task decoder branches have the same architecture and combine elements from typical semantic segmentation architectures such as the ASPP  as well as the commonly used U-Net decoder structure (Ronneberger et al., 2015) for depth estimation. This allows for pretraining the segmentation decoder branch with SDE and repurposing it for semantic segmentation afterward. For the pose estimation network, we use the same design as in (Godard et al., 2019). For the SDE network on the source domains, we use an output stride of 32 and a reduced number of decoder channels in order to improve convergence.

Training
For the SDE pretraining, the depth and pose network are trained using the Adam optimizer, a batch size of 4, and an initial learning rate of 1 × 10 −4 , which is divided by 10 after 160k iterations. The SDE loss is calculated on four scales with three subsequent frames. During the first 300k iterations, only the depth decoder and the pose network are trained. Afterwards, the depth encoder is fine-tuned with an ImageNet feature distance λ F = 1 × 10 −2 for another 50k iterations. The encoder is initialized with ImageNet weights, either before depth pretraining or before semantic segmentation if depth pretraining is ablated. The baseline is trained with the same hyperparameters but only with a cross-entropy loss on the labeled samples. Its encoder is initialized with ImageNet pretrained weights.
For the semi-supervised multi-task learning, we train the network using SGD with a learning rate of 1 × 10 −3 for the encoder and depth decoder, 1 × 10 −2 for the segmentation decoder, and 1 × 10 −6 for the pose network. The learning rate is reduced by 10 after 30k iterations and the network is trained for another 10k iterations. A momentum of 0.9, a weight decay of 5 × 10 −4 , and a gradient norm clipping to 10 are used. The loss for segmentation and SDE are weighted equally. The mean teacher has α = 0.99 and within an iteration, the network is trained on a clean labeled and an augmented mixed batch with size 2, respectively. The latter uses DepthMix with = 0.03, color jitter, and Gaussian blur. If only pseudo-labeling but no mixing is used in an experiment, color jitter and Gaussian blur are still applied to the augmented batch.
For SSDA, the same hyperparameters as in the SSL setting are used. For the Matching Geometry Sampling, the number of random source candidate samples is set to 5: |C src | = 5.

Automatic Data Selection for Annotation
For the automatic data selection, we use a slimmed network architecture for f SIDE with a ResNet50 backbone, reduced decoder channels, and BatchNorm (Ioffe and Szegedy, 2015) in the decoder for efficiency and faster convergence. The depth student network is trained with a berHu loss using Adam with a learning rate of 1 × 10 −4 and polynomial decay with exponent 0.9. For calculating the depth feature diversity, we use the output of the second depth decoder block after SDE pretraining. It is downsampled by average pooling to a size of 8x4 pixels and the feature channels are normalized to zero-mean and unit-variance over the dataset. The student depth error is weighted by λ E = 1000. The number of the selected samples ( t t =1 n t ) is incrementally increased to 25, 50, 100, 200, 372, and 744. For each subset, a student depth network is trained from scratch for 4k, 8k, 12k, 16k, and 20k iterations, respectively, to calculate the student depth error and select the samples for the next subset. The quality of the selected subset with annotations G A is evaluated for semantic segmentation using our default architecture and training hyperparameters. For the entropy baseline, a semantic segmentation network is trained on G A and the samples with the highest mean pixel-wise Shannon entropy of the semantic segmentation prediction are greedily chosen from G U to extend G A . Apart from that, the entropy baseline uses the same hyperparameters as our method.

Automatic Data Selection for Annotation
First, we evaluate the proposed automatic data selection (see Sec. 3.2) on the Cityscapes (Cordts et al., 2016) dataset. Tab. 1 shows a comparison of our method with a baseline and a competing method for different numbers of selected labeled samples. The first baseline selects the labeled samples randomly, while the second, strong competitor uses active learning and iteratively chooses the samples with the highest segmentation entropy. In contrast to our method, this requires a human in the loop to create the semantic labels for iteratively selected images. Tab. 1 shows that our method with diversity sampling (DS) works better than with uncertainty sampling (US) for few labeled samples. We hypothesize that, for a small number of annotated samples, it is more important to better cover the underlying distribution with a diverse subset than just covering uncertain/difficult samples. For a larger subset, however, it makes sense to focus on the uncertain samples as the common cases are most likely already   covered. Further, it can be seen that combining diversity sampling and uncertainty sampling (DS+US) performs better than using them individually showing that these criteria are complementary and cover two relevant aspects of selecting data for annotation. When comparing our method with both sampling criteria (DS+US) with the baselines "Random" and "Entropy", it can be seen that our method outperforms both comparison methods, demonstrating the effectiveness of ensuring diversity and exploiting difficult samples based on depth estimation. It also supports the assumption that depth estimation and semantic segmentation are correlated in terms of sample difficulty. With 1/4 of the labeled samples, our method achieves 98.8 % of the fully-supervised performance and with only 1/8 samples it still reaches 94.8 %. Furthermore, the standard deviation of the achieved segmentation performance with our data selection is noticeably lower than for the random baseline when using few labeled samples, resulting in better reproducibility.
To better understand the underlying reasons for the improved performance, we analyze the class-wise IoU for 372 labeled samples in Tab. 2. It shows that our automatic data selection significantly improves the performance of difficult classes with a low IoU of the random baseline such as wall, fence, truck, bus, and train. In comparison to the strong active learning entropy baseline, our method achieves even better performance for the classes wall, rider, truck, and bus.
In order to investigate possible reasons for the improved performance of the automatic data selection, we visualize the ratio of the automatically selected pixels and total dataset pixels grouped by the ground truth class for 372 selected samples in Fig. 7. As expected, the ratio is about 0.125 for most of the classes when selecting 1/8 of the samples randomly (Fig. 7 left). For the entropy baseline and our method, it can be seen that a higher ratio of difficult/rare classes (e.g. truck, bus, and train) are sampled from the underlying training set, while a smaller ratio of common classes such as road and building are sampled. When comparing the class-wise IoU (Tab. 2) and the ratio of selected pixels (Fig. 7), it can be seen that the improvement for difficult classes is correlated with them being selected more frequently by the automatic data selection. Intuitively, more samples of rare and easy to confuse classes such as car, truck, bus, and train as well as wall and fence will help the classifier to distinguish them. When comparing the active learning entropy baseline to our method, Fig. 7 shows that our method selects a higher ratio of wall, person, rider, and truck, which directly connects to the higher class IoU for these classes as shown in Tab. 2. Please note that the class-statistics of Fig. 7 are not available to our method during the entire selection process. This demonstrates that our method is able to correctly estimate the utility of samples for subsequent semantic segmentation without knowing the ground truth labels during the selection.

DepthMix Data Augmentation
Second, we study the proposed geometry-guided mixing strategy DepthMix (see Sec. 3.3). We evaluate the performance for the SSL setting with 372 of the labeled training samples (which corresponds to 1/8 of the labeled samples in Cityscapes) and the fully-supervised setting with 2975 samples. The subset of labeled samples is chosen randomly. Tab. 3 shows the mean and standard deviation of the mIoU in percent over three random seeds. Additionally, the improvement in percentage points of the analyzed components over the baseline, which only uses a cross-entropy loss on labeled samples, is shown. In accordance with the literature on semisupervised mixing (French et al., 2020;Olsson et al., 2021;Sohn et al., 2020), we first add self-training with pseudolabels from the mean teacher to the framework. As can be seen in Tab. 3, this already significantly improves the performance in the SSL setting by +3.24 mIoU percentage points. Still, our proposed DepthMix module further increases the performance by another +1.76 (+2.06) percentage points for  , the performance of DepthMix is still +0.98 (+0.23) percentage point higher for 372 (2975) samples. This demonstrates the effectiveness of the geometryaware mixing, which better handles occlusions as described in Sec. 3.3. The higher improvement of DepthMix for fewer labeled samples might be since the SDE for DepthMix can be trained on a large set of unlabeled samples, resulting in precise depth contours over the whole (un)labeled training set. ClassMix in contrast uses segmentation pseudo-labels for mixing, which were only supervised on the subset of labeled samples. Therefore, on the unlabeled samples, the mixing contours can be less accurate than for DepthMix.
Further, we analyze the class-wise IoU for 372 labeled samples as shown in Tab. 4. Pseudo-labels generally improve the IoU through self-training. However, for the rare class motorcycle, the IoU decreases compared to the baseline. The reason for that is probably a pseudo-label drift of motorcycle towards the similar class bicycle during the self-training. Both mixing strategies mitigate the drift by a better generalization from labeled to unlabeled data through providing different contexts and occlusions during the training. The better generalization leads to less erroneous pseudo-labels and consequently to less drift. Additionally, this also results in a higher IoU for other difficult classes with a low baseline IoU such as sidewalk, wall, fence, traffic light, traffic sign, rider, truck, bus, and train. When comparing DepthMix and ClassMix, it can be seen that DepthMix improves over ClassMix for difficult classes with usually pronounced depth contours such as wall, traffic light, rider, bus, train, and motorcycle. However, there is a slight decrease in IoU for the classes sidewalk and terrain. These are classes, which can be easily confused with each other and with road. DepthMix might experience difficulties with these classes as there are usually no depth contours between them, which results in fewer mixing boundaries.
The effective occlusion handling of DepthMix can be seen in Fig. 8 a) -c) for samples from Cityscapes. It shows input images in orange and blue as well as their SDE used for mixing. The column "DepthMix Select." visualizes from which input image the regions, chosen by DepthMix, originate. As can be seen in Fig. 8 a), DepthMix is able to handle occlusions at multiple levels. The biker from the blue image occludes buildings from the orange image, but the blue biker is itself also partly occluded by the closer biker from the orange image. Similar cases can be seen for trees, traffic signs, and cars in Fig. 8 b) and c). The column "Mixed Image I " shows the resulting image without the selection overlay. It can be seen that due to the spatially accurate depth contours, the mixed images contain only minor mixing border artifacts and have a realistic appearance. The same is the case for the mixed segmentation as can be seen in the column "Mixed Segm. S " However, there are also some cases in which DepthMix fails to correctly mix images according to their geometry. Examples of typical failure cases are shown in Fig. 8 d) and e). First, the SDE can be inaccurate for dynamic objects due to the violation of the static world assumption, which can cause an inaccurate structure within the mixed image. This is particularly the case if a car is driving in front of the ego car ( Fig. 8 d)). However, this type of failure case is common in ClassMix and its frequency is greatly reduced with DepthMix. A remedy might be SDE extensions that incorporate the motion of dynamic objects (Casser et al., 2019;Dai et al., 2020;Klingner et al., 2020b). Second, in some cases, the SDE can be imprecise and the depth discontinuities do not appear at the same location as the class border. This can cause artifacts in the mixed image as well as in the mixed segmentation as can be seen for the sky within the building in Fig. 8 e). Note that the same can happen for ClassMix when the pseudo-labels, used for the mixing, do not have accurate segmentation borders.

Transfer and Multi-Task Learning
Third, we study the proposed transfer and multi-task learning of semantic segmentation and the auxiliary task selfsupervised depth estimation. For 372 (2975) samples, SDE transfer learning of the encoder and decoder (with previous ImageNet pretraining of the encoder) improves performance by +1.31 (+1.23) percentage points mIoU over the baseline Fig. 8 Examples of DepthMix applied to Cityscapes crops. From left to right, the source images with their SDE estimate, the mixed image I overlaid with the border of the mix mask M in blue/orange depending on the adjacent source image (i -orange, j -blue), the mixed image without visual guidance I , the mixed depth D , and the mixed segmentation S are shown. For simplicity, the source segmentations for the mixed segmentation S originate from the ground truth labels. Rows a) -c) demonstrate the strength of DepthMix to handle occlusions, while rows d) and e) show typical failure cases with only ImageNet pretraining of the encoder. This demonstrates the usefulness of the features learned by SDE for semantic segmentation, both in the semi-and fully-supervised case. Additional regularization of the encoder with an Ima-geNet feature distance loss during SDE pretraining improves the performance by another +0.35 (+0.48) percentage points. Furthermore, multi-task learning in addition to transfer learning results in a performance increase of +0.45 (+0.29) percentage points.
The class-wise analysis for 372 labeled samples (see Tab. 6) shows that SDE transfer learning without ImageNet Feature distance loss significantly improves the performance of classes, where segmentation border coincides with depth discontinuities such as fence, pole, traffic light, and traffic sign. This is possibly due to their characteristic depth profile learned during SDE. For example, a good depth estimation performance requires correctly segmenting poles or traffic signs as missing them can cause large depth errors. However, there is a performance drop for classes that have slight se-  mantic differences such as truck, bus, train, and motorcycle. We hypothesize that the SDE pretraining causes forgetting important semantic features from the ImageNet pretraining that are relevant for semantic segmentation but not for SDE. For example, for SDE it is not relevant if an object is a bus or a train but for semantic segmentation it is. Adding the ImageNet feature distance loss to the SDE pretraining in order to avoid forgetting these semantic features, prevents the performance drop for truck, bus, and train. The additional multi-task learning further improves the performance for the small difficult classes rider and motorcycle.

Combined Framework for SSL
Next, we combine the three contributions multi-task learning, DepthMix, and automatic data selection for annotation into a unified semi-supervised semantic segmentation framework. The first part of Tab. 7 summarizes the performance of these components from the previous sections for a better comparison. The component with the most improvement is the automatic data selection for annotation with diversity and uncertainty sampling with +5.11 mIoU percentage points for 372 labeled samples. However, it is not applicable to the full dataset as there is no need for sample selection -all samples are used. The second-most effective component is DepthMix with pseudo-labeling, which also has a pronounced mIoU improvement of +5.00 (+2.06) for 372 (2975) samples. The smallest but still significant improvement comes from multitask learning with +2.00 (+1.99) percentage points. The direct comparison of the class-wise IoU for 372 labeled samples in Tab. 8 shows that data selection mostly improves the performance of difficult classes with a low baseline IoU (e.g. wall, fence, truck, bus, and train), SDE multi-task learning of classes with surrounding depth discontinuities (e.g. fence, pole, traffic light, traffic sign, and rider), and DepthMix of both.
Considering that the three contributions follow different approaches and improve the performance of a different subset of classes, we further study the combination of our contributions as shown in the second part of Tab. 7 and Tab. 8. The improvement over the baseline performance is +6.21 when combining multi-task learning with data selection, +7.34 when combining DepthMix and data selection, and +7.52 (+3.40) when combining multi-task learning and DepthMix for 372 (2975) samples. In all cases, the combination is better than every single component. The class-wise analysis for 372 labeled samples in Tab. 8 reveals that the class performance of the combination usually is the highest class performance of the components. As the components perform well on different classes, this already attributes to the improved performance of the combinations. Moreover, there are some classes such as fence, traffic sign, rider, truck, bus, and train, where the performance of the combination is even higher than its  88 37 37 48 43 57 89 52 92 68 39 90 39 47 33 32 63   97 73 89 40 40 50 47 61 90 53 93 70 44 90 44 48 34 37 64   97 76 89 49 44 50 49 62 90 52 93 71 44 91 53 63 46 34 65   97 74 89 42 41 47 46 59 89 54 93 70 43 91 66 69 53 35 64   96 73 89 43 43 49 46 62 90 55 93 70 44 91 71 73 58 32 64   97 77 89 47 45 50 49 63 90 54 93 72 45 92 69 77 55 34 65   97 77 90 49 46 53 52 65 90 53 94 72 48 92 60 69 54 38 66   97 77 90 47 47 52 51 65 90 55 94 73 51 92 66 79 65 35 67 best component. This might be due to self-reinforcing effects. For example, the improved segmentation detail at depth contours from multi-task learning is propagated into DepthMix and results in even better pseudo-label supervision for mixed samples. The last row of Tab. 7 shows the combination of all three contributions. With an improvement of +8.87 percentage points for 372 labeled samples, it achieves the best results so far. It combines the strength of our three contributions and significantly improves the performance for classes with depth discontinuities and for difficult classes. The most improvement is achieved for truck, bus, and train, where the mIoU is more than 50% better than the baseline.

Comparison with State-of-the-Art SSL Methods
Next, we compare our approach with several state-of-the-art SSL approaches. The results are summarized in Tab. 9. The performance (mIoU in %) of the SSL methods and their baselines (which use the same backbone network but are only trained on the labeled dataset) are shown over a different number of labeled samples. As the performance of the baselines differs, there are columns showing the absolute improvement for better comparability. As our baseline utilizes a more capa- Table 9 Comparison with state-of-the-art SSL semantic segmentation methods on the Cityscapes validation set (mIoU in %, standard deviation over 3 random seeds). The best results are shown in bold font and the second-best results are underlined.  ble network architecture due to the U-Net decoder with ASPP as opposed to a DeepLabv2 decoder used by most previous works, we also reimplemented the state-of-the-art method, ClassMix  with our network architecture and training parameters to ensure a direct comparison. As shown in Tab. 9, our method (without data selection) outperforms all other approaches on each labeled subset size for both the absolute performance as well as the improvement to the baseline. The only exception is the absolute improvement of the original results of ClassMix for 100 labeled samples. However, if we consider ClassMix trained in our setting, our method outperforms it also in this case. This can be explained by the considerably higher baseline performance in our setting, which increases the difficulty to achieve a high improvement. Adding data selection even further increases the performance by a significant margin, so that our method, trained with only 1/8 of the labels, even slightly outperforms the fully-supervised baseline.
To identify whether the improvement originates from access to more unlabeled data or from the effectiveness of our approach, we compare it to another baseline "ClassMix (+Video)". More specifically, we also provide all unlabeled The adequacy of our approach is also reflected in the example predictions in Fig. 9. We can observe that the contours of classes are more precise. This is particularly the case for classes, which are surrounded by depth discontinuities such as poles, traffic signs, rider, or person. Moreover, difficult objects such as bus, train, rider, or truck can be better distinguished. As discussed in Sec. 5.4, this observation is also quantitatively confirmed by the class-wise IoU improvement shown in Tab. 8. On the downside, SDE sometimes fails for cars driving directly in front of the camera (see 7th row in Fig. 9) and violating the reconstruction assumptions. Those cars are observed at the same location across the image sequence and can not be correctly reconstructed during SDE training, even with correct depth and pose estimates. However, the network-internal differentiation between moving and non-moving cars does not hinder the transfer of SDE-learned features to semantic segmentation but can cause problems with DepthMix (see Sec. 5.2).

Learning SDE and Semantic Segmentation on Different Datasets
In this section, we show that the unlabeled image sequences and the labeled segmentations can also originate from different datasets within similar visual domains. For that purpose, we train the SDE on Cityscapes sequences and learn the semi-supervised semantic segmentation on the CamVid dataset (Brostow et al., 2009). As we assume in this scenario that there are no image sequences available for SDE training on CamVid, we only apply transfer learning but no multi-task learning.
Tab. 10 shows that the results on CamVid are similar to our main results on Cityscapes. For 50/100/367 labeled training samples, our method improves the mIoU by +9.0/ +6.5/+3.3 percentage points. In the end, our proposed method significantly outperforms ClassMix  by +2.3 percentage points for 50 labeled samples and +2.1 percentage points for 100 labeled samples.

Component Study for SSDA
We study the components of the SSDA framework described in Sec. 3.5 on the commonly used benchmark GTA5 → Cityscapes, where the synthetic source training samples originate from the GTA5 dataset (Richter et al., 2016) and the real target training samples are obtained from Cityscapes (Cordts et al., 2016). After the training, the network is evaluated on the target validation samples from the Cityscapes validation set. First, we analyze our contributions from SSL in an SSDA setting by naively adding the additional source samples to the training according to Eq. 18. The remaining framework is the same as in the previous experiments.
The first part of Tab. 11 shows the results using the SSL framework without source domain supervision, while the second part shows the results for the framework with additional semantic segmentation supervision from the source domain according to Eq. 18.
For 100 labeled samples from the target domain, Tab. 11 shows that additional source domain supervision improves the performance of the baseline by +5.08 percentage points. As can be seen in Tab. 12, this is mainly due to improvements for classes with a low baseline performance such as wall, fence, traffic light, rider, truck, bus, and motorcycle. However, additional source domain supervision deteriorates the performance for the classes sidewalk, terrain, and bicycle, which are easy to confuse and have a considerable  domain gap. When applying our proposed methods from SSL, they also lead to an improved performance in the SSDA setting as shown in the second part of Tab. 11. For multi-task learning, the gain is +2.37 percentage points with the same performance pattern of the class-wise IoU. For DepthMix, the improvement is +6.22, while it also effectively counters the performance drop (from Baseline to SD) for the classes road, sidewalk, terrain, and bicycle (see Tab. 12). For automatic data selection, the improvement by additional source data is +1.09. When combining the three contributions, the performance gain over the baseline with source supervision is +10.71. This is +2.45 percentage points better than our method for SSL.
For 500 labeled samples from the target domain, additional source domain supervision decreases the performance for the baseline by -0.67 percentage points (see Tab. 11). This shows that additional source supervision is not helpful in this case, probably, because there is already decent supervision on the target domain and naively adding the source domain loss cannot close the domain gap. But also in this setting, multi-task learning / DepthMix / data selection can still improve the performance by +1.47 / +5.2 / +0.98 over the baseline with source supervision. When being combined, their performance gain is +7.64. This is +0.88 percentage point better than our method for SSL.
Next, we analyze our contributions tailored to overcome the domain gap of SSDA: Cross-Domain DepthMix (see Sec. 3.5.2) and Matching Geometry Sampling (see Sec. 3.5.3). Tab. 13 shows that both Cross-Domain DepthMix (CDM) and Target-Domain DepthMix (TDM) significantly outperform the baseline. As shown in Tab. 14, this is due to an improved performance for difficult classes such as sidewalk, wall, traffic sign, terrain, rider, truck, train, and motorcycle. Through DepthMix presenting these objects with different backgrounds and occlusions, the network learns to generalize better within the target domain (for TDM) or across domains (CDM). When comparing the performance of CDM and TDM (see Tab. 13), it can be seen that CDM works better for 100 labeled target samples and TDM works better for 500. On the one side, CDM can exploit the labeled source data to propagate its knowledge to the target data through mixing. This is especially useful if there are only a few labeled target samples available and most supervision comes from the source domain. On the other side, TDM can use the already labeled target samples to propagate their knowledge to the unlabeled target through mixing, without being impeded by a domain gap. This is most effective when there are sufficient labels from the target domain available.
Based on this observation, we conclude that it might be useful to combine CDM and TDM to align labeled source and target samples as well as labeled target and unlabeled target samples. Tab. 13 shows that CDM+TDM indeed improves the performance over only CDM and only TDM by +0.70 (+0.79)  for 100 (500) labeled target samples due to an improved performance for the classes sidewalk, wall, fence, traffic sign, terrain, and train.
To further improve the Cross-Domain DepthMix, we apply the proposed Matching Geometry Sampling to overcome the geometric domain gap of source and target domain and to better align the geometric distribution of the mixed samples to the geometric target distribution as discussed in Sec. 3.5.3. Tab. 13 shows that it improves the mIoU by +1.65 (+0.16) percentage points for 100 (500) labeled target samples. The geometry and view alignment is probably more important for fewer labeled target samples because it is more difficult to bridge the geometric domain gap. For 100 labeled samples, the improvement mainly originates from difficult vehicles such as truck, bus, and motorcycle (see Tab. 14).
When combining the domain adaptive strategies (combined CDM+TDM and Matching Geometry Sampling) with the previous contributions from SSL, the SSDA performance can be further improved by +3.01 (+2.74) percentage points for 100 (500) labeled target samples (see Tab. 13). Overall, our contributions sum up to +17.26 (+8.22) percentage points improvement over the baseline using only target supervision and +12.18 (+8.89) percentage points improvement over the baseline with target and source supervision. Especially, the performance of truck, bus, and train is increased by more than 50% as shown in Tab. 14.

Comparison with State-of-the-Art SSDA Methods
Finally, we compare our framework with other state-of-theart SSDA methods on the benchmarks Synthia → Cityscapes (Tab. 15) and GTA → Cityscapes (Tab. 16). For each method, its baseline performance is provided because the methods differ in their architecture and labeled subset. For better comparability between the architectures, we show the relative performance in % with respect to the fully-supervised baseline. As the previous SSDA methods did not publish their implementation, labeled subset, or variance over the subset selection, we adapted the UDA state-of-the-art methods DACS  to our framework for a fair comparison with a competitive method.
Considering the mIoU and the relative performance with respect to the fully-supervised baseline, our method noticeably outperforms the competitors for 100, 200, and 500 labeled target samples on both benchmarks. Only in the fullysupervised case, Chen et al. (2021a) achieves slightly better results. Moreover, it can be seen that even if we remove the data selection for annotation from our method, the previous statements still hold.
We would like to highlight that our method achieves 97.4% (GTA → Cityscapes) and 98.7% (Synthia → Cityscapes) of the fully-supervised baseline performance with only about 1/30 (100) of the target labels. With about 1/15 of the target labels, it even reaches the fully-supervised baseline performance. The improved performance for 100 labeled target samples can also be observed in Fig. 10, where our method better distinguishes difficult classes such as truck, bus, and train and produces more detailed segmentation contours for classes such as pole, traffic sign, and rider.

Conclusions
In this work, we have studied how self-supervised depth estimation (SDE) can be utilized to improve semantic segmentation in the single-domain semi-supervised and the domainadaptive semi-supervised setting.
We introduce four effective strategies capable of leveraging the knowledge learned from SDE. First, we present an automatic data selection for annotation algorithm based on SDE, which does not require human-in-the-loop annotations and, therefore, increases flexibility, efficiency, and scalability. By combining diversity sampling based on features from self-supervised depth estimation and uncertainty sampling based on the depth student error, our method significantly outperforms random data selection and even entropy-based active learning, which requires a human in the loop. We show that without knowledge of the class labels, our data selection for annotation prefers samples, which contain difficult/rare classes (e.g. rider, truck, bus, and train). This results in a significantly higher semantic segmentation performance of these classes.
Second, we demonstrate that the proposed DepthMix strategy outperforms related mixing strategies by avoiding an inconsistent geometry of the generated images. We show that DepthMix effectively improves the performance for classes with a low baseline performance such as wall, fence, traffic light, rider, truck, bus, and train. We assume that DepthMix improves generalization by presenting labeled and pseudolabeled instances with different backgrounds and occlusions.
Third, we show that the feature representation from selfsupervised depth estimation can be transferred to semantic segmentation, by means of SDE pretraining and multi-task learning of semantic segmentation and SDE. This is particularly effective for difficult classes surrounded by depth discontinuities such as wall, fence, pole, traffic, light, traffic sign, rider, truck, and motorcycle. By using an ImageNet feature distance loss during the SDE pretraining, we mitigate forgetting useful semantic features from ImageNet pretraining and avoid the resulting performance drop for semantically similar classes such as truck, bus, train, and motorcycle.
And fourth, we show the effectiveness of combined Cross-Domain and Target-Domain DepthMix as well as Matching Geometry Sampling in a semi-supervised domain adaptation setting. The former effectively aligns source and target data as well as labeled target and unlabeled data to generate highquality pseudo-labels for unlabeled target data. The latter samples source images with a similar scene geometry and camera pose with respect to target images to produce more realistic Cross-Domain DepthMix images.
A combination of the first three contributions in a singledomain semi-supervised framework can achieve even higher performance gains than the single components as the approaches address different aspects of the learning process. By using these SDE-based contributions, our approach results in state-of-the-art performance for semi-supervised semantic segmentation. Our method achieves 92% of the fullysupervised baseline performance with only 1/30 of the available labels and even slightly outperforms it with only 1/8 of the labels.
A combination of all four contributions in a semi-supervised domain adaptation framework improves the performance even further and outperforms previous state-of-theart semi-supervised domain adaptation methods. On GTA → Cityscapes, our method achieves even 97% of the fullysupervised baseline performance with only 1/30 of the target labels. This roughly corresponds to only 150 working hours for data annotation for the target domain instead of 4460 working hours.
All in all, our findings suggest that SDE can be a valuable source of self-supervision for semantic segmentation, improving the semantic segmentation performance and reducing the number of necessary annotations. Table 15 Comparison with other SSDA methods for GTA → Cityscapes. The mIoU in % on the Cityscapes validation set is shown for a different number of labeled target samples. Mean and standard deviation are aggregated over 3 random seeds. Additionally, the relative performance (Rel.) in % with respect to the fully-supervised baseline is shown. The best results are shown in bold font and the second-best results are underlined.