On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

Alcover-Couso, Roberto; SanMiguel, Juan C.; Escudero-Viñolo, Marcos; Garcia-Martin, Alvaro

doi:10.1007/s11042-023-14662-0

On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

Open access
Published: 11 March 2023

Volume 82, pages 35879–35911, (2023)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

Download PDF

Roberto Alcover-Couso ORCID: orcid.org/0000-0001-9609-4416¹,
Juan C. SanMiguel¹,
Marcos Escudero-Viñolo¹ &
…
Alvaro Garcia-Martin¹

1337 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Pixel-wise image segmentation is key for many Computer Vision applications. The training of deep neural networks for this task has expensive pixel-level annotation requirements, thus, motivating a growing interest on synthetic data to provide unlimited data and its annotations. In this paper, we focus on the generation and application of synthetic data as representative training corpuses for semantic segmentation of urban scenes. First, we propose a synthetic data generation protocol, which identifies key features affecting performance and provides datasets with variable complexity. Second, we adapt two popular weakly supervised domain adaptation approaches (combined training, fine-tuning) to employ synthetic and real data. Moreover, we analyze several backbone models, real/synthetic datasets and their proportions when combined. Third, we propose a new curriculum learning strategy to employ several synthetic and real datasets. Our major findings suggest the high performance impact of pace and order of synthetic and real data presentation, achieving state of the art results for well-known models. The results by training with the proposed dataset outperform popular alternatives, thus demonstrating the effectiveness of the proposed protocol. Our code and dataset are available at http://www-vpu.eps.uam.es/publications/WSDA_semantic/

Effective Use of Synthetic Data for Urban Scene Semantic Segmentation

Towards Multi-source Adaptive Semantic Segmentation

Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Article Open access 11 May 2023

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic segmentation refers to the task of classifying each pixel in a given image to its semantic category, providing pixel-level masks of images. This task is specially interesting for urban scenes scenarios, where accurate pixel-wise understanding of the image can help in many applications, such as autonomous driving and robot vision.

Deep Neural Networks (DNN) have proven their efficacy for segmentation, being the current state of the art in different tasks, such as semantic segmentation [41], X-ray lung segmentation [53], Brain MR image segmentation [54] and video object segmentation [72]. However, DNN training depends on extensive amounts of labeled images which are expensive to label. To this end, the usage of synthetic data for training neural networks becomes an alternative to get large corpuses at a reduced cost. Therefore, many efforts have shifted the focus onto synthetic data as a plausible solution, [47, 65], using 3D environments with models of semantic objects to classify, [18, 21, 46]. Although promising, synthetic images often present different visual appearance than real images: light reflection, color saturation and shadows make synthetic images distinguishable from the real ones. Specially on urban scenes, where synthetic images are not able to capture the wide variability of real images (e.g., light conditions). While many approaches have been proposed to obtain synthetic data, [6, 7, 20, 25, 50] no methodological generation of synthetic data has been defined for the specific task of semantic segmentation. Thus, a methodology for obtaining synthetic data at different complexity levels would be desirable.

Regarding the visual appearance discrepancy between real and synthetic images, Domain Adaptation (DA) encompasses the class of techniques that extrapolate generalities obtained from synthetic data to real domain. Two main research lines can be differentiated depending on whether or not real images ground truth is available during training. Weakly Supervised Domain Adaptation focuses on abstracting knowledge from both domains [57, 69, 81], using a combination of extensive amounts of labeled data from synthetic images and a small amount of labeled real images to refine the segmentation DNN. Differently, Unsupervised Domain Adaptation (UDA), tries to generalize to the real data without reling on labeled real images by aligning features from both domains [16, 22, 63, 64, 74]. Commonly UDAs literature focus on generating domain agnostic features by aligning outputs from different levels of the model. However, such alignment is defined for specific layers of the model, hence, impeding the straightforward extrapolation of successful proposals to different architectures. Furthermore, most works follow a random image presentation during training due to the absence of a definition for a sample-based complexity in semantic segmentation. However, smart image presentation protocols have proven effective in other fields [2]. Therefore, comparing the existing approaches and defining a curriculum (applicable to semantic segmentation employing different sources of synthetic and real data) would be beneficial to understand and improve the performance of real data on its own.

In this paper, we address the above-mentioned limitations for semantic segmentation (1) by proposing a protocol for synthetic data generation; (2) by analyzing popular strategies for training and transfer learning using real and synthetic data; (3) and also by defining a curriculum-based strategy to effectively combine multiple sources of synthetic data. We employ the simulation tool Multi-camera System Simulator (MSS) [24] to generate synthetic sets with several configuration options (e.g., number of classes, viewpoints). The synthetic data generated by our protocol is compared against widely used datasets [46, 47] for different training strategies such as combined training [43], and fine-tuning [57]; and also for different proportions of real and synthetic data. Moreover, we proposed a curriculum-based learning strategy relying on the hypothesis that an increasing-complexity data feeding strategy would generalize better to the target real data than a standard-paced (i.e. random) strategy using the same data. We take advantage of our protocol to generate datasets with increasing complexity (defined as the number of instances in the dataset) and use these datasets for curriculum learning. The experimental results show how not only the proposed generation protocol outperforms existing synthetic datasets, but how the combination of different sources and structuring of the training process in an incremental complexity manner can improve state of the art performance. It is noteworthy to highlight that the proposed data generation protocol and training strategies can be applied to any architecture for aligning different synthetic domains to real data, without relying on specific alignment terms.

The contributions of the proposed approach are:

A new design protocol for synthetic data generation based on virtual scenario simulators.
Identifying and comparing training strategies for weakly supervised domain adaptation in semantic segmentation, measuring the impact of different synthetic sources and different proportions of real data.
Proposing a new strategy based on curriculum learning for employing different sources of data, applicable to DNN-based approaches for semantic segmentation.

The paper is organized as follows. Section 2 reviews the state-of-the-art on domain adaptation and synthetic data usage for semantic segmentation. Section 3 introduces the criteria and design protocol for synthetic data generation. Section 4 describes the selected strategies for weakly supervised domain adaptation. Section 5 presents the experimental results, including a comparison with the state-of-the-art. Finally, conclusion remarks are described in Section 6.

2 Related work

2.1 Domain adaptation

A basic assumption in machine learning is having the training and test data sampled independently from an identical distribution. In the context of domain adaptation this assumption is not fulfilled [28], having two domains with clear visual discrepancies: the source data, used for training, and target data, used for fine grained training and testing. Therefore, direct training on the source data leads to a significant performance drop on the target test set. This hindrance is commonly known as domain shift.

Single-source domain adaptation

Alternatively to alleviate the domain shift one may extrapolate knowledge from synthetic to real images in the training process of the model. Depending on whether the ground truth of real images is available during training, this can be further classified into Unsupervised Domain Adaptation (UDA), if not labeled real images are available during training, and Weakly-Supervised Domain Adaptation (WSDA), if a small set of labeled real images are available during training. UDA frameworks employ target RGB images during training to align features from both domains [36, 63, 64, 74, 83]. However, some other works show that effective extrapolation to the real domain can be obtained even without including any real image during training by increasing gradually the complexity of the sample images in a curriculum manner [17, 27, 37]. Following this idea, we propose a synthetic dataset generation protocol to aid the straight-forward implementation of an easy-to-hard image presentation for semantic segmentation. WSDA approaches [9, 57, 69, 81] deploy high performance models for real data adapting from abundant labeled synthetic images but scarce and insufficient labeled real data. Other approaches follow some sort of adversarial learning strategy, [25, 62], which characterizes for the inclusion of an additional, —usually small—, discriminator network which tries to discriminate from the segmentation maps if the input RGB image is real or synthetic. However, adversarial training is generally known as a difficult task due to its instability [63], hence, we do not consider these approaches for this work and will only used for comparative purposes.

In different computer vision fields such as object localization, some works obtain state of the art performance by defining an easy to hard presentation commonly known as curriculum learning [2]. Nowruzi. E. et al. [43] studied the impact of the real data size in weakly supervised object localization. Similarly, Zheng. Q. et al. [57] studied the impact of different pacing strategies when using different ratios of real images. Following this line of work, our proposal aims at further structuring the pacing showcase of different sources of data when compared to the typical finetuning strategy and combined training. To the best of our knowledge, there is no other similar study for semantic segmentation.

Multiple-source domain adaptation

In practice, the source labeled data may come from different domains, such as different simulators for synthetic data or day and night images for real domains, this motivates the research of Multiple Source Domain Adaptation (MSDA) techniques. However, multiple source combined training usually leads to worst performance models than employing one single source for training [19, 77]. In order to overcome this limitation, many authors focus on aligning features from all source domains with target domain features [3, 11, 17, 28, 37, 50, 51, 62, 63, 66, 74, 81]. Three common methods for aligning features in single and multiple source domain adaptation are:

Discrepancy based: These alignment frameworks focus on minimizing an explicit distance measure between features obtained in the target and source domains [60]. Various distribution discrepancy metrics have been introduced, including Maximal Mean Discrepancy (MMD) [26], Correlation Alignment [56] and Wasserstein distance [1]. MMD is currently the most widely used metric to measure the distance between two feature distributions [35].
Adversarial based: These proposals rely on the inclusion of a discriminator model which measures how domain-discriminative the features generated by the segmentator are. Following the typical adversarial scheme of GANs [25, 62], this training paradigm becomes a min-max game, where the segmentator model aims at fooling the discriminator. In essence, by minimizing the performance of the discriminator the segmentator is minimizing the gap between domains in the feature space [3, 28, 50, 62,63,64, 74].

Recently, this idea has been made more explicit by adversarial methods defining strategies to translate image appearance from one domain to another. These proposals tend to combine adversarial training with an additional term of consistency. This consistency term measures the discrepancy between the output produced from the original image and the translated image [76]. Intuitively, these proposals minimize the domain shift by aligning features across real and synthetic domain, while maximizing the performance in a mutual domain.
Entropy based: These works minimize the entropy on the target domain. As the labels of the target images are not available during training, minimizing the entropy is in a way a self supervision mechanism. Being C as the number of classes, H and W the height and width of the input image, Y_t the one-hot encoded C-vector label and the prediction entropy, E, $E = {\sum }_{C}{\sum }_{H,W} P(X_{t})log(P(X_{t}))$ and the classical cross-entropy loss [11, 63]:
$$ L_{seg} = \sum\limits_{C}\sum\limits_{H,W} Y_{t}log(P(X_{t})) $$
(1)

Summary of domain adaptation proposals

Table 1 summarizes the explored proposals dealing with Domain Adaptation in computer vision. All the proposals for semantic segmentation [3, 11, 28, 37, 50, 51, 62, 63, 66, 74, 81] tasks employ deeplab and/or FCN as the segmentator of choice. Hence, for a fair comparison, we employ both architectures in our experiments.

Table 1 Summary of state of art domain adaptation proposals

On exploring weakly supervised domain adaptation strategies for semantic segmentation using synthetic data

Abstract

Similar content being viewed by others

Effective Use of Synthetic Data for Urban Scene Semantic Segmentation

Towards Multi-source Adaptive Semantic Segmentation

Improving Semi-Supervised and Domain-Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

Explore related subjects

1 Introduction

2 Related work

2.1 Domain adaptation

2.2 Curriculum learning

2.3 Synthetic data generation for semantic segmentation

2.4 Real datasets for semantic segmentation

3 Synthetic dataset based on the MSS simulator

3.1 Design criteria

3.2 Protocol

4 Weakly-supervised strategies for training

4.1 Combined training

4.2 Fine-tuning

4.3 Curriculum learning

5 Experimental results

5.1 Experimental setup

5.2 Baseline: training with only real or synthetic data

5.3 Combined training: concurrent synthetic-real data usage

5.4 Fine-tuning: using pre-training from synthetic data

5.5 Curriculum learning

5.6 Comparison with state-of-the-art methods

5.7 Discussion

6 Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation