Class-conditional domain adaptation for semantic segmentation

Semantic segmentation is an important sub-task for many applications. However, pixel-level ground-truth labeling is costly, and there is a tendency to overﬁt to training data, thereby limiting the generalization ability. Unsupervised domain adaptation can potentially address these problems by allowing systems trained on labelled datasets from the source domain (including less expensive synthetic domain) to be adapted to a novel target domain. The conventional approach involves automatic extraction and alignment of the representations of source and target domains globally. One limitation of this approach is that it tends to neglect the diﬀerences between classes: representations of certain classes can be more easily extracted and aligned between the source and target domains than others, limiting the adaptation over all classes. Here, we address this problem by introducing a Class-Conditional Domain Adaptation (CCDA) method. This incorporates a class-conditional multi-scale discriminator and class-conditional losses for both segmentation and adaptation. Together, they measure the segmentation, shift the domain in a class-conditional manner, and equalize the loss over classes. Experimental results demonstrate that the performance of our CCDA method matches, and in some cases, surpasses that of state-of-the-art methods.


Introduction
Semantic segmentation is an important visual scene-understanding task with a wide range of applications, particularly in autonomous and assisted vehicle systems [1].Recent deep network approaches (e.g., Refs.[2][3][4]) have achieved impressive results, but they require large training datasets with precise pixellevel ground-truth annotations.This may also lead to poor generalization ability due to the large domain shifts in appearance, viewpoint, and lighting between the source training and target testing domains [5].
These issues can potentially be addressed using the unsupervised domain adaptation method that attempts to identify and correct for a shift in the appearance of visual input between different domains.This is achieved by training a semantic segmentation model with a large number of synthetic source domain images that do not perfectly represent the appearance of real scenes but have easily obtainable groundtruth labels, as well as real-world target domain images whose ground-truth labels remain unknown.Therefore, a successful domain adaptation method will not only improve generalization but also avoid the time-consuming annotation for pixel-level multi-class segmentation of real-world scenes.
A common approach to solve the "domain shift" problem for deep network systems is to modify the weights of the network to render representations of target domain images more similar to the representations of source domain images.By minimizing the distance between the distributions of certain representations in both domains, a wellgeneralized model can be obtained.Some existing works have focused on representations in the prediction space [6,7], while others have focused on representations in feature (latent) space [8,9].Representational dissimilarity can be assessed using correlation distances [10] or maximum mean discrepancy [11].However, recent studies have focused on generative adversarial methods [12] for unsupervised domain adaptation.This adversarial principle has become prominent since it achieved promising results in pixel-level prediction tasks [6,13].
One limitation of previous studies on unsupervised domain adaptation for semantic segmentation is that they tend to measure feature extraction and alignment globally while ignoring the influence of different classes [14,15].The representation extraction and alignment ability of different classes can be affected by the occurrence frequency or appearance similarity between domains.An underlying tendency can be observed in that representations on classes with higher frequency can be easily extracted for segmentation, and representations on classes with higher appearance similarity between domains can be easily adapted.Therefore, the network may fail to extract meaningful feature representations from some classes using global segmentation prediction measurement.In addition, the global alignment of representations may cause the representations of some classes not to be fully adapted during training or cause the representations of classes that are already easily aligned to be mapped to incorrect classes.
To address the above issues, we propose a novel Class-Conditional Domain Adaptation (CCDA) method, which considers both adaptation and segmentation in a class-conditional manner.It comprises a class-conditional multi-scale discriminator and classconditional loss functions for both segmentation and adaptation.Our class-conditional multi-scale discriminator encourages the network to align featurelevel representations in a class-wise manner on both fine (pixel-level) and coarse (patch-level) spatial scales.For the coarse-scale branch, class-conditional adaptation is considered flexibly by requiring the discriminator to retain semantic information within each patch.It allows the adaptation on each class to be measured separately without neglecting any class.For the fine-scale branch, the class-conditional adaptation loss is equalized over classes to ensure that equal attention is paid to the alignment of each class.Moreover, the design of class-conditional segmentation loss function assists the network to fairly evaluate the segmentation performance on each class.
In summary, our proposed CCDA approach comprises three novel contributions: • We propose a novel class-conditional multi-scale discriminator, which allows adaptation to be learnt in a class-wise manner.• By equalizing class-conditional losses over classes for both segmentation and adaptation, the CCDA system pays equal attention to different classes.• Experimental results demonstrate that the observed performance matches, and in some cases, surpasses that of state-of-the-art algorithms on several domain adaptation scenarios.

Domain adaptation:
Research on domain adaptation for image classification has been conducted for many years, with a focus on solving the "domain shift" problem between different datasets on imagelevel representations.In the early stages, traditional distance minimization methods were proposed to reduce the distance between image representations from the source and target domains.For example, Ref. [16] used the maximum mean discrepancy (MMD) loss, and Ref. [17] applied coral loss.With the development of generative adversarial networks, many recent studies have achieved domain adaptation by minimizing the distance between representations using generative adversarial methods, which achieve better performance [18,19].Domain adaptation for semantic segmentation: Although substantial progress has been made in domain adaptation for image classification, pixellevel tasks are more challenging because of their direct dependence on local appearance.Nevertheless, increasing activity in autonomous vehicle applications has driven interest in domain adaptation for pixellevel segmentation of road scenes [8,20].Currently, the most popular approach to domain adaptation for pixel-level segmentation relies on adversarial learning, which is widely used for image generation [12,21] and translation [22][23][24].
For domain adaptation, adversarial learning employs a discriminator on the segmentation network to align the source and target representations at either the prediction level [6,14] or feature level [8,9,20].Tsai et al. [6] employed an adversarial network to align pixel-level representations for adaptation.Vu et al. [14] then employed an indirect entropy minimization technique to improve the prediction-level adaptation.Luo et al. [9] used an information bottleneck to help remove task-independent information from feature-level representations during adaptation.Shan et al. [15] fused multi-level features for both segmentation and adaptation to allow gradients to flow into lowlevel CNN layers along a shorter path.Zhou et al. [13] performed domain adaptation on the affinity relationship between adjacent pixels to leverage the co-occurring patterns during adaptation.
In addition, the self-training approach can select pseudo ground-truth labels for target domain images to help supervise adaptation and improve performance [25][26][27][28].The source-free method [29,30] focuses on adaptation using only a well-trained source model and unlabeled target domain data.Additional techniques, such as image translation, can be combined with representation adaptation methods.Image translation focuses on narrowing down the domain shift between the source and target domains at the input image level by generating translated source domain images with target styles [27,31].However, most existing domain adaptation methods for semantic segmentation tended to measure segmentation and adaptation globally, while ignoring the influence of different classes, which may affect their performance.
Region-wise/class-wise domain adaptation: Adversarial approaches like Refs.[32,33] tend to boost the domain adaptation performance for different classes or regions of the image.This suggests that a region-or class-wise domain adaptation approach is required to achieve good adaptation across all classes.Luo et al. [32] applied a co-training strategy to increase the weight of adaptation for poorly-aligned regions with inconsistent semantic predictions.Yang et al. [34] iteratively perturbed the intermediate feature maps with several attack objectives, which helps treat the information at each position evenly during adaptation.Tsai et al. [7] clustered patches based on spatial patterns and used cluster information as a guide to achieve better adaptation for each patch.However, these methods are still unable to equally and separately measure the segmentation and adaptation of each class, which may still limit their performance.
To achieve explicit class-wise adaptation, Chen et al. [8] applied 19 sub-discriminators during training, where each sub-discriminator is specially trained to measure the alignment of one class.Du et al. [33] further improved this framework by separating an entire feature into 19 sub-features based on the pseudo class label and inputting each sub-feature into the corresponding sub-discriminator for independent class-wise adaptation.Because the memory of the sub-discriminators varies linearly with the number of classes, it is less efficient to apply 19 subdiscriminators during training and may not be flexible when applied to datasets with more classes.
Here, we propose our CCDA method, which is a more holistic solution that entails one classconditional multi-scale discriminator and classconditional loss functions for both segmentation and adaptation.Using the class-conditional multiscale discriminator, we allow the adaptation to be learnt in a class-wise manner.Equalizing the loss over classes for both segmentation and adaptation also helps pay equal attention to all classes.Meanwhile, by forcing the discriminator to maintain the semantic information while adversarially aligning the distributions between domains for each class, we can avoid using multiple sub-discriminators that apply one sub-discriminator to each class.Compared with the framework proposed by Du et al., our method is more efficient and flexible because we only require a single discriminator and still manage to separately and equally measure the adaptation for each class.

Methods
First, we describe the basic domain adaptation framework for pixel-level semantic segmentation.Next, we explain the innovations of our CCDA system in detail.It comprises two major components: classconditional domain adaptation and class-conditional segmentation.In Section 3.2, we describe our classconditional multi-scale discriminator, which contains both fine-and coarse-scale branches.In Section 3.3, we describe our class-conditional segmentation part.Figure 1 shows an overview of our CCDA system.

Basic domain adaptation architecture
We employed an adversarial learning approach to achieve unsupervised domain adaptation for semantic segmentation.The basic structure of this approach Our goal is to train the feature encoder E and segmentation decoder S to output a good pixellevel segmentation prediction P t ∈ R C×H×W for the target domain image.This is achieved through two processes: training E and S to output a good segmentation prediction P s ∈ R C×H×W for the source image I s with the associated label Y s , and using the discriminator D to align the feature-level representations F s and F t output by the feature encoder E for the two domains.
The first process (segmentation) is trained by minimizing the segmentation cross-entropy loss as Eq. ( 1): where (h, w) denotes pixel position, and c ∈ {1, 2, where L bce is the binary cross-entropy domain classification loss.The output channel of this basic discriminator D is 1 because it is for two classes (source and target domains).The source and target domain samples are assigned labels of 0 and 1, respectively.Normally, D(F ) outputs a prediction that retains the resolution of the input feature representation instead of a prediction with one single value at the global image level.Therefore, adaptation can be measured more precisely by calculating the average loss over all positions in the input feature.
Concurrently, the feature encoder E attempts to confuse D, minimizing This basic structure extracts the semantic representations and produces an alignment of the features globally among all classes.It does not consider that different classes may have different influences on the segmentation and adaptation.This tends to cause predictions on some lower-frequency classes that do not contribute substantially to the cross-entropy loss, or representations on some classes that are not fully adapted owing to the dissimilar appearance between domains.This could also deconstruct the existing alignment and cause regions that belong to classes that have already been well adapted to be mistakenly adapted to other classes.Meanwhile, because the feature map computed by the feature encoder E is spatiotopic but its resolution is reduced relative to the input image, the alignment achieved by this process is at a specific intermediate scale of the feature map, which may not capture the domain shift at smaller or larger scales.These observations motivate our class-conditional multi-scale discriminator and classconditional segmentation.

Class-conditional multi-scale discriminator
Our proposed class-conditional multi-scale discriminator is composed of fine-and coarse-scale branches (Fig. 1).The fine-scale branch measures alignment at the pixel level based on the basic architecture with modified loss functions.It captures spatially-detailed domain shift phenomena in a class-wise manner.The coarse-scale branch measures the class-conditional alignment at the patch level, which is coarser than the feature scale with equal class information.First, we describe how to perform this class conditioning by explaining the design of the coarse-scale class label.Then, we elaborate on the structure of the classconditional coarse-scale discriminator branch as well as the fine-scale branch.

Coarse-scale class label
We define a coarse-scale binary class label W with length C that indicates the presence or absence of each class within a rectangular patch of the image.It should be noted that a patch may contain multiple classes.For the source image, W s is computed by analyzing the pixel-level ground-truth label Y s within the image back-projection of a patch.For a patch at position (j, k), if any pixel within the back-projected region of the image has class c, we set For the target domain image, we do not have ground-truth label.Instead, we assign the coarse-scale class label based on the projected pixel-level prediction P t of our segmentation module S for the patch.In particular, for the patch at position (j, k), given a confidence threshold th w , if any pixel (h, w) within the back-projected region of the image contains Note that binarizing the patch-based class label W equalizes the class information at the patch level: W [c, j, k] = 1 if the patch (j, k) contains any pixels of class c, regardless of the number.This has the benefit of maintaining semantic information in our discriminator without neglecting any classes in a patch.It also applies equal attention to all the classes that a patch contains and boosts the adaptation performance in a class-wise manner.

Class-conditional coarse-scale branch
In standard feature-level domain adaptation, the discriminator output for each patch indicates the domain of the entire patch (in our case, 0 for source domain, 1 for target domain).To The advantage of this two-vector representation is that it allows us to multiplex both domain and class information, informing both adversarial adaptation loss based on the class and non-adversarial classification loss (Fig. 2).In particular, to determine the non-adversarial classification loss, we form the vector O c = σ (O s + O t ), where the sigmoid function σ(•) is applied separately for each class.Therefore, O c [c, j, k] estimates the probability that the patch (j, k) contains one or more pixels drawn from class c.We calculate the classification loss for both domains using a binary cross-entropy loss L bce (O c , W ) averaged over all classes, because each patch can contain multiple classes.Including this classification loss in the discriminator encourages the feature-level domain alignment to preserve the segmentation class information for patches from both the source and target domain images.Because of the binary nature of the coarse-scale class label vector W , we prevent the discriminator from neglecting any classes with a small number of pixels in a patch.
To obtain the adversarial adaptation loss, we form a C × 2 matrix O st = f ([O s , O t ]) for each patch, where f (•) is the softmax operation over rows (with a channel number of 2, where the first C ×1 is for source, and the second is for target).Therefore, for patch (j, k), O st [c, 1, j, k] represents the probability that the pixels of class c in this patch belong to the source domain, and O st [c, 2, j, k] represents the probability that the pixels of class c in this patch belong to target domain, indicates that for each class, the probability that any pixels of this class present in a patch are drawn from the source versus target domains.Therefore, it allows the adaptation of each class that occurs within the patch to be measured separately with only one discriminator.
To form the final class-wise discriminator domain adaptation loss for the coarse-scale branch, we average the class-conditional loss over the classes present in a patch.This means weighting the sum of the losses using ground-truth patch-level class label W s for the source domain and predicted patch-level class prediction O c t (O c for the target sample) for the target domain, and then dividing by the sum of the weights.Combined with the non-adversarial classification loss, the total patch-level discriminator loss is Here, we expect it to predict O st s [c, 2, j, k] = 1 for the source domain and O st t [c, 1, j, k] = 1 for the target domain to confuse the discriminator.

Class-conditional fine-scale discriminator
Coarse-scale class-conditional adaptation can capture larger-scale domain shift effects but may not capture shifts in finer detail.Thus, we employ a classconditional fine-scale discriminator operating at the pixel level.The scale of the feature representations in this fine-scale discriminator branch remains, and the output is upsampled to produce a fine-scale domain classification U s ∈ R 1×H×W for I s and U t ∈ R 1×H×W for I t that match the original input size.For this fine-scale discriminator branch, we do not need to retain semantic information, but we still evaluate the performance of the adaptation in a class-wise manner using a designed class-conditional loss that equalizes the performance among all classes.
For the source domain, we employ the groundtruth class label Y s to calculate the loss for each class and equally average over the classes to form a class-conditional binary cross-entropy loss: where C * is the number of classes present in the source image.The ground-truth domain label l d is set to l d = 0 when training the discriminator D and l d = 1 when training the encoder E and segmentation decoder S, to confuse the discriminator. is a small constant that prevents division by 0 for the classes that do not appear in the ground truth within an image.
For the target domain image, we do not have a ground-truth class label; therefore, we employ pixellevel class prediction P t instead to form a pseudo label Ŷt by selecting the class with the highest prediction value: For some pixels, P t may have low entropy, which means that the network is confident and the pseudo labels on these pixels may be a good estimate of the ground truth class.For other pixels, P t may have high entropy, which can be considered as a sign that the domain shift may be interfering with classification.Thus, the adaptation loss for pixels with uncertain predictions can be upweighted to improve adaptation to the domain shift.In particular, we designate these ambiguous pixels using the label A t ∈ R 1×H×W : where th a is a threshold constant for selecting the uncertain pixels.We then add a term to the fine-scale domain adaptation loss that serves to upweight these regions during feature alignment.
Thus, the final class-conditional binary crossentropy loss for the target domain images becomes where C * denotes the number of classes present in the target domain image, as predicted by Ŷt .It uses l d = 1 to train D and l d = 0 to train E and S, to confuse the discriminator.
The fine-scale class-conditional discriminator loss for both the source and target domain images is then The generative component of the adversarial fine-scale loss trained on feature encoder E and segmentation decoder S is symmetrically defined as For stability, we blend these class-conditional finescale losses with the conventional losses defined in Eqs. ( 2) and (3) to obtain the adaptation loss for the fine-scale branch: where β is a weight to combine the losses.Therefore, the overall discriminator and adversarial losses of our class-conditional multi-scale discriminator combine the losses from both fine-scale and coarse-scale branches: L adv all = L adv fine + L adv coarse (15)

Class-conditional segmentation loss
The conventional loss employed for pixel-level semantic segmentation is pixel-level cross-entropy loss.
This means that the segmentation predictions are measured globally among the entire image, regardless of the class information.However, this method has a drawback in that some classes tend to be less frequent among the datasets or have objects with smaller sizes at the pixel level, which do not contribute substantially to the loss function.This conventional global segmentation loss has an additional consequence for the domain adaptation system in that the system may never learn how to align representations across domains for these classes.
Here, we introduce a modified class-conditional loss for segmentation that serves to distribute the loss more evenly across classes.Specifically, we employ the concept of dice loss [35] to train the segmentation network.Dice loss is widely used in medical image segmentation [36,37] and has the form in Eq. ( 16): Note that the loss is similar in spirit to intersectionover-union and can equalize the contribution of each class to measure segmentation performance.is a small constant that prevents division by 0 for classes that do not appear in the ground-truth and prediction.
Dice loss can measure segmentation by class, which tends to increase the weight of loss in rare classes.However, this may also introduce instability during training.Therefore, we employ a combination of the dice loss and cross-entropy loss (Eq.( 1)) to form our segmentation prediction loss: where α is a weight to combine the losses.

Complete training loss
Following Refs.[14,34,38], we also add a regular entropy minimization loss L ent for the segmentation prediction of the target domain image, which encourages our model to produce predictions with high confidence: where λ ent is a weight to balance the losses.

Datasets and implementation details
Following most domain adaptation for segmentation methods, we evaluated our class-conditional domain adaptation method on semantic segmentation using two synthetic source domain datasets (GTA5 [39] and SYNTHIA [40]) and a real-world target domain dataset (Cityscapes [41]).This defines two adaptation tasks: GTA5 → Cityscapes and SYNTHIA → Cityscapes.The GTA5 dataset comprises 24,966 images, while the SYNTHIA dataset comprises 9400 images.Both synthetic datasets include pixel-level ground-truth semantic segmentation labels.The Cityscapes dataset contains 2975 training images and 500 validation images.To train the proposed domain adaptation model, we employed both images and ground-truth labels from either the GTA5 or SYNTHIA dataset as the source domain, and only the images (not the labels) from the Cityscapes training set as the target domain.We evaluated our model on the Cityscapes validation set over 19 classes for GTA5 → Cityscapes task and over 13 and 16 classes for SYNTHIA → Cityscapes task, as per convention [6,26].We implemented our training and evaluation in PyTorch on a single GeForce RTX 2080 Ti GPU with 11 GB of memory.We used the DeepLab-v2 [42] framework with a small pre-trained VGG16 [43] model as the backbone for our feature encoder E and segmentation decoder S. For the discriminator module D, the fine-scale branch has a structure similar to that in Ref. [6].For the coarse-scale branch, we share the first two convolution layers with the finescale branch and then apply three convolution layers with channel number {256, 512, C × 2} and a kernel size of 3 and stride of 2 for downsampling.Except for the last convolution layers in both branches, each convolution layer in our discriminator module is followed by a Leaky-ReLU [44] layer with a slope of 0.2 for negative inputs.We also applied a two-stage training strategy using a self-training process, which is also used in Refs.[7,25].In stage 1, we trained an initial CCDA model first for 100k iterations, and in stage 2, we further fine-tuned our entire CCDA model with another 100k iterations as well as adding self-training on the target domain.
To train our feature encoder E and segmentation decoder S, we used a Stochastic Gradient Descent (SGD) optimizer [45] with a momentum of 0.9 and weight decay of 5×10 −4 .The initial learning rate was set to 2.5×10 −4 and decayed during training.For the discriminator module D, we applied the ADAM [46] optimizer with β1 = 0.9 and β2 = 0.99.The initial learning rate was set to 1 × 10 −4 and decayed using the same policy as SGD.Our model was trained using batch size of two with one source domain image and one target domain image, and we resized the input images as H × W = 512 × 1024, which is the same as in Ref. [9].Therefore, the number of patches for the coarse-scale discriminator branch M = H 64 × W 64 = 8 × 16.We set the thresholds th w = 0.4, th a = 0.95.We also set λ sa = λ ta = 0.0003, λ sd = λ td = 0.5 across all loss functions, and α = 0.5, β = 0.4, λ n = 3, λ ent = 0.05 for the blend losses.

Comparison with state-of-the-art methods
Tables 1 and 2 summarize the performance of our overall CCDA method compared with state-of-the-art methods on the two transfer tasks GTA5 → Cityscapes and SYNTHIA → Cityscapes, respectively.For a fair comparison, we compared our method with the state-of-the-art methods using the same VGG16 backbone.These methods include adaptation on prediction-level representation methods (A-P): AdaptSeg [6], ADVENT [14], DPR [7], SSP [15], APO [34], ASA [13], TTDA [47]; adaptation on feature-level representation methods (A-F): FCNsW [20], Cross-city [8], SIBIN [9], OCE [38], SSF-DAN [33]; adaptation on both predictionand feature-level representation method (A-PF): CLAN [32]; self-training (ST) methods: CBST-SP [25], CDA [26]; and source-free (SF) methods: SFDA [29], UBNA [30].Here, almost all adaptation on representation methods apply adversarial learning as our method does, except for OCE.We also present visual comparisons in Fig. 3 and Fig. 4.However, because only a few state-of-the-art methods provide the available code and pretrained models, we could not obtain the predicted segmentation maps of all methods.Therefore, we only present visual comparisons with methods that have available code and pretrained models, as well as better performance, in these two figures.Fig. 3 Qualitative results of semantic segmentation on the GTA5 → Cityscapes task.For each target image, we show the corresponding ground-truth map and the results of AdaptSeg, SSF-DAN, and our proposed CCDA method.We highlight some improved predictions with white dashed boxes.
GTA5 → Cityscapes: Table 1 shows that our proposed CCDA method performs much better on average than all state-of-the-art methods on the GTA5 → Cityscapes task.This advantage is derived from improvements over a wide range of classes.We achieved the best or second-best performances for nine classes, while for other classes, we were still able to reach results that are comparable to those of other methods.We also present visual comparisons of the proposed method with two other methods on this task in Fig. 3.This shows that, compared with the other methods, our method can not only provide Fig. 4 Qualitative results of semantic segmentation on the SYNTHIA → Cityscapes task.For each target image, we show the corresponding ground-truth map and the results of OCE and our proposed CCDA method.We highlight some improved predictions with white dashed boxes.
cleaner predictions of more frequent classes, such as sidewalks and roads, but also improve the detection of less frequent classes, such as signs and lights.This demonstrates the effectiveness of the proposed CCDA method with a class-conditional multi-scale discriminator and class-conditional segmentation loss.
SYNTHIA→Cityscapes: Table 2 shows that our proposed CCDA method outperformed all methods in terms of mIoU over 16 classes on the SYNTHIA → Cityscapes transfer task.However, it achieved a slightly lower performance than SSF-DAN and OCE in terms of mIoU over 13 classes.Compared with the GTA5 dataset, the SYNTHIA dataset has fewer training images and may be more divergent from the Cityscapes dataset.Therefore, the SYNTHIA → Cityscapes transfer task was more difficult than the GTA5 → Cityscapes transfer task.To achieve explicit class-wise adaptation, the SSF-DAN method first separates the features into 19 parts for 19 classes and then applies 19 sub-discriminators, each of which can separately align features from each class.It is possible that the independent training of separate discriminators for each class performed by SSF-DAN makes it slightly better than our method (with a 0.2% improvement in mIoU* over 13 classes) for this more difficult task with a larger shift.However, this strategy requires the training of 19 sub-discriminators for 19 classes in the Cityscapes dataset.Because the model size of each sub-discriminator is normally approximately 9 MB, the overall model size of the 19 sub-discriminators in SSF-DAN could be approximately 170 MB.By contrast, our proposed class-conditional discriminator with two branches can employ a single discriminator system for all classes and occupy a much more economical 15 MB, which facilitates training and expansion to more categories.Therefore, the proposed CCDA method can be more efficient during training.It may also be more flexible than the sub-discriminator-based system of SSF-DAN for datasets with more categories because we need only change the number of output channels in the coarse-scale branch of the proposed class-conditional discriminator.
OCE also slightly outperformed our method in terms of mIoU* over 13 classes on the SYNTHIA → Cityscapes transfer task.It achieved feature alignment using contrastive learning, which aims to maximize the distance between different classes in the feature space.This strategy could achieve promising results for this more challenging transfer task by better distinguishing classes with lower frequency or smaller objects, such as motorbikes and riders.However, it outperformed our method by only 0.5% in terms of mIoU* over 13 classes on the SYNTHIA → Cityscapes task, while for mIoU over 16 classes on the SYNTHIA → Cityscapes task, OCE was 0.1% worse than our proposed method.This suggests that consistently superior results for mIoU* and mIoU for this transfer task could be achieved by incorporating contrastive learning into our classconditional adaptation system, which could be a promising topic for future work.
Compared to OCE, our method performed better on 9 out of 16 classes for the SYNTHIA → Cityscapes task.For the GTA5 → Cityscapes transfer task, OCE was 5.8% worse than our method in mIoU over 19 classes and achieved worse results than our method on almost all classes.We believe that this large improvement is primarily because our method applies the class-conditional approach for both feature extraction and feature alignment, which generally improves the overall performance.In addition, our proposed CCDA method may have a better generalization ability for different transfer tasks and classes compared to OCE.On the SYNTHIA → Cityscapes transfer task, the proposed method achieved the best or second-best performance for two classes compared with other state-of-theart methods.Our method also achieved results comparable to those of the other methods over a large range of classes for this transfer task, which led to a better overall performance.We also present a visual comparison of our method with OCE on this task in Fig. 4.This indicates that our method provides cleaner predictions on more frequent classes, such as sidewalks and roads, and improved detection on less frequent classes, such as bicycles.
In general, we observe that compared with other methods that boost the performance for a few classes while sacrificing the performance for other classes, our class-conditional method often boosts the performance of almost all classes.Thus, although our method may not achieve the best performance among all classes, it ultimately achieves the capacity to generate higher mean IoU performance overall.
Complementarity with image translation: Meanwhile, we noticed that the image translation technique, which lowers the image style differences between domains, is complementary to methods with adaptation on representation.This has been applied to several recent state-of-the-art domain adaptation methods [27,48] to generate translated source domain images in the target style to further alleviate the shift between the two domains at the input image level.
Although our method focuses primarily on domain adaptation on representation, we also conducted experiments on the GTA5 → Cityscape task to demonstrate the complementarity of the proposed CCDA method with an image translation technique.For simplicity, we adopted the translated GTA5 images generated by Ref. [27] together with the original GTA5 images as the source domain input images to train our model.Meanwhile, we applied two rounds of self-supervised learning following Refs.[27,28,49] to enhance the performance.All other settings in this experiment were the same as those described in Section 4.1.
We compared our method with six state-of-theart methods using image translation techniques: BDL [27], SEDA [50], LTIR [28], ITRA [48], CRA [31], and DPL [49].Here, LTIR and CRA combine their proposed structures with existing image translation techniques (e.g., BDL), while others train their own image translation modules.The quantitative results of our method ("Ours+IT") with these state-of-the-art methods using VGG16 as the backbone are listed in Table 3.Note that we used image translation in a very simple manner.We did not use two types of style-transferred source domain images like LTIR does, and we did not use a teacherstudent network that requires larger memory and computational costs during training, like SEDA does.Our proposed CCDA with image translation still outperformed most state-of-the-art methods, which demonstrates the complementarity of the proposed CCDA method with image translation.
Our performance was only worse than that of DPL because DPL applies a dual-path learning strategy by training two image translation networks with two corresponding segmentation networks.By training the image translation and segmentation networks instead of directly using translated images from an existing method as ours does, the translated images generated by DPL are more suitable for its segmentation networks.Meanwhile, DPL requires two image translation networks to not only generate translated source domain images with target style, as BDL and ITRA do, but also produce translated target domain images with the source style.Therefore, using the translated images from the two domains definitely helps to better alleviate the shift between the two domains, which The translated target domain images are then used as inputs to obtain the segmentation predictions.This helps to further improve the performance; however, it also affects the efficiency of DPL during testing.

Ablation studies
Ablation study on different components: To better understand the impact of each component of our adaptation model, we conducted an ablation study by selectively deactivating each component and measuring the effect on the performance of the GTA5 → Cityscapes transfer task.Specifically, we defined four nested subset models: (1) B: Using the basic domain adaptation architecture in Section 3.1 with segmentation loss (Eq.( 1)) and a fine-scale basic discriminator for adaptation (Eqs.( 2) and ( 3)).This means that we set α = 1 and β = 1 in the blend losses.
(4) Ours: Our final model with the extra entropy minimization loss in Eq. (18).
The results are presented in Table 4 showing that our overall CCDA system resulted in a performance gain of 3.1% over the basic domain adaptation architecture in mIoU.The class-conditional segmentation loss alone was responsible for a 1.5% improvement, the class-conditional discriminator produced an additional 1.3% improvement, and the entropy minimization loss provides a slight 0.3% improvement.This verifies the importance of both the class-conditional segmentation and discriminator components of our CCDA approach.Examples of qualitative segmentation are shown in Fig. 5.
To analyze the impact of each component on the  different classes more effectively, we also present the frequency of each class in the Cityscapes dataset in descending order in Table 4. Measuring the segmentation prediction in a class-wise manner in "B + S c " tends to improve the performance on some less frequent classes, such as lights and bikes.This is reasonable because using classconditional segmentation loss may increase the weight of the loss on classes with lower frequencies or objects with smaller sizes.This improves the overall performance and prevents the model from neglecting these classes during adaptation.However, it may also sacrifice performance in more frequent classes, such as roads and skies.By further adding our classconditional discriminator as in "B + S c + D c ", it promotes a class-wise adaptation with the design of our coarse-and fine-scale branches.This achieves improvements on both more frequent classes, such as roads and sidewalks, and less frequent classes, such as riders and buses."Ours" with the extra entropy minimization loss further helps to balance the performance on different classes and slightly improve the overall performance.Compared to the basic model "B", our overall CCDA system ("Ours") achieved improvements on almost all classes.Figure 5 also proves the effectiveness of each component of our CCDA system.Compared to "B", "B + S c " performed better on some smaller objects and less frequent classes, such as signs and lights.However, for some larger objects and more frequent classes, such as sidewalks and roads, the performance of "B + S c " may not be improved or even become worse."B + S c + D c " can further improve the performance on some less frequent classes, such as terrain, while achieving improvements on some larger objects and more frequent classes compared to "B + S c ". "Ours" can further slightly improve the results by balancing the performance among different classes.Therefore, our overall CCDA system enhances the performance through general improvements on various classes.
Ablation study on thresholds: We apply two thresholds in our method: th w and th a .To avoid neglecting any classes that occur in each target domain patch for achieving the coarse-scale class labels used in the class-conditional coarse-scale discriminator branch, we set the threshold th w = 0.4.To avoid ignoring any ambiguous pixels in the target domain images for the class-conditional fine-scale discriminator branch, we set the threshold th a = 0.95.To explore the sensitivity of the two thresholds, we conducted experiments on the GTA5 → Cityscapes task by setting different th w and th a .The results are presented in Tables 5 and 6, which show that by setting these thresholds within reasonable ranges, the overall performance of the proposed method is not significantly affected.

Limitations and future work
As illustrated by the previous experimental results (Fig. 3 and Fig. 4), the proposed CCDA method performs well in most situations.However, it still has some limitations; for example, it may fail to detect some objects with very small size, or it may be

Conclusions
We developed a novel approach to solve an important problem for domain adaptation for semantic segmentation in that the different abilities for representation extraction and alignment for different classes may affect the adaptation performance.
The solution hinges on the introduction of classconditioning at multiple points in the model, including a class-conditional segmentation loss and class-conditional multi-scale discriminator, which measure the segmentation prediction and adaptation in a class-wise manner.The experimental results of the ablation study demonstrate that our overall CCDA method improves performance for almost all classes and boosts overall performance.Extensive experimental results demonstrate the effectiveness of our method by reaching comparable results, and in some cases, outperforming state-of-the-art methods.

Fig. 1
Fig. 1 Overview of our proposed Class-Conditional Domain Adaptation system.It consists of three parts: Feature Encoder E, Segmentation Decoder S, and Discriminator D. Orange arrows indicate the flow for the source domain, green arrows indicate the flow for the target domain, and grey arrows represent the flow for both domains.Given a source image Is and target image It, we first pass them through E and S to obtain their feature-level representations Fs, Ft and pixel-level segmentation predictions Ps, Pt.Then, Fs and Ft from the two domains are input into D for feature-level representation alignment.To fairly measure the segmentation prediction for each class, we propose a class-conditional segmentation loss to supervise Ps of Is based on its ground-truth label Ys.To measure the feature alignment in a class-wise manner, we designed a class-conditional multi-scale discriminator.The fine-scale branch of D uses class-conditional adaptation loss to pay equal attention to the pixel-level alignment for each class, while the coarse-scale branch allows patch-level adaptation on each class to be separately and flexibly measured.More details of the coarse-scale branch in D are shown in Fig. 2.
apply classconditional adaptation, Du et al. designed a subdiscriminator system with 19 sub-discriminators, each specially trained for one corresponding class, and achieved good performance.By contrast, the class-conditional discriminator we designed consists of both semantic classification and adaptation.It maintains semantic information while measuring the class-wise adaptation adversarially, which avoids the usage of sub-discriminators.The output of our class-conditional coarse-scale discriminator branch consists of two vectors, O s and O t , each of length C. O s [c, j, k] estimates the probability that the patch (j, k) contains one or more pixels drawn from class c of the source domain, while O t [c, j, k] estimates the probability that the patch (j, k) contains one or more pixels drawn from class c of the target domain.

Fig. 2
Fig. 2 Details of our class-conditional discriminator on the coarse-scale branch.(a) Overview.(b) Details for one patch (j, k) from source domain as an example.For better understanding, pc = O c s [c, j, k] is the estimated probability that the patch (j, k) from the source domain contains one or more pixels belonging to class c. qc = O st s [c, 1, j, k] represents the estimated probability that pixels of class c in patch (j, k) belong to the source domain.O st s [c, 2, j, k] = 1 − qc indicates the probability that pixels of class c in patch (j, k) belong to the target domain.Here, the usage of class-conditional domain vectors O s s and O t s (O s and O t for source domain) allows multiplexing of both domain and class information, informing an adversarial adaptation loss based on class and a non-adversarial classification loss.O c s is supervised by the corresponding class label Ws for maintaining semantic information, while O st s is supervised by domain label for class-wise adaptation with Ws as weights.Therefore, it allows a flexible and separate feature alignment for each class.

Fig. 5
Fig.5 Qualitative results of ablation study on the GTA5 → Cityscapes task.For each target image, we show the corresponding ground-truth map and the results of each subset model in the ablation study.We highlight some improved predictions with white dashed boxes.
) where O st s is the output O st for the source domain image, O st t is the output O st for the target domain image, and M is the number of patches.Here, the discriminator is trained to precisely distinguish the features from both domains for each class.

Table 1
Adaptation from GTA5 to Cityscapes.We present the per-class and mean IoU.Here, "A" represents adaptation on representation methods, "-P" represents adaptation on prediction-level representation, and "-F" represents adaptation on feature-level representation."ST" and "SF' represent self-training and source-free methods, respectively.We highlight the best and second-best results in each column in

Table 3
Adaptation from GTA5 to Cityscapes with extra image translation (IT) technique.We present the per-class and mean IoU, and we highlight the best result in each column in bold font

Table 5
Sensitivity analysis of threshold thw

Table 6
Sensitivity analysis of threshold tha unable to predict clear edges for large objects under complicated scenarios.Therefore, in future work, we will investigate how to improve the performance by training our model with the help of multiple related tasks, such as depth estimation, object detection, and boundary detection.These related tasks can be beneficial to the segmentation model by providing additional information.For example, an object detection task [51, 52] may assist the network in maintaining semantic information for smaller objects, while boundary detection [53] can be helpful for learning a clearer contour for each object.Moreover, video segmentation [54, 55] is a challenging task, and we plan to investigate how our proposed classconditional method can be extended to video semantic segmentation in an unsupervised manner, as well as more applications with various datasets.