1 Introduction

Semantic segmentation refers to the task of classifying each pixel in a given image to its semantic category, providing pixel-level masks of images. This task is specially interesting for urban scenes scenarios, where accurate pixel-wise understanding of the image can help in many applications, such as autonomous driving and robot vision.

Deep Neural Networks (DNN) have proven their efficacy for segmentation, being the current state of the art in different tasks, such as semantic segmentation [41], X-ray lung segmentation [53], Brain MR image segmentation [54] and video object segmentation [72]. However, DNN training depends on extensive amounts of labeled images which are expensive to label. To this end, the usage of synthetic data for training neural networks becomes an alternative to get large corpuses at a reduced cost. Therefore, many efforts have shifted the focus onto synthetic data as a plausible solution, [47, 65], using 3D environments with models of semantic objects to classify, [18, 21, 46]. Although promising, synthetic images often present different visual appearance than real images: light reflection, color saturation and shadows make synthetic images distinguishable from the real ones. Specially on urban scenes, where synthetic images are not able to capture the wide variability of real images (e.g., light conditions). While many approaches have been proposed to obtain synthetic data, [6, 7, 20, 25, 50] no methodological generation of synthetic data has been defined for the specific task of semantic segmentation. Thus, a methodology for obtaining synthetic data at different complexity levels would be desirable.

Regarding the visual appearance discrepancy between real and synthetic images, Domain Adaptation (DA) encompasses the class of techniques that extrapolate generalities obtained from synthetic data to real domain. Two main research lines can be differentiated depending on whether or not real images ground truth is available during training. Weakly Supervised Domain Adaptation focuses on abstracting knowledge from both domains [57, 69, 81], using a combination of extensive amounts of labeled data from synthetic images and a small amount of labeled real images to refine the segmentation DNN. Differently, Unsupervised Domain Adaptation (UDA), tries to generalize to the real data without reling on labeled real images by aligning features from both domains [16, 22, 63, 64, 74]. Commonly UDAs literature focus on generating domain agnostic features by aligning outputs from different levels of the model. However, such alignment is defined for specific layers of the model, hence, impeding the straightforward extrapolation of successful proposals to different architectures. Furthermore, most works follow a random image presentation during training due to the absence of a definition for a sample-based complexity in semantic segmentation. However, smart image presentation protocols have proven effective in other fields [2]. Therefore, comparing the existing approaches and defining a curriculum (applicable to semantic segmentation employing different sources of synthetic and real data) would be beneficial to understand and improve the performance of real data on its own.

In this paper, we address the above-mentioned limitations for semantic segmentation (1) by proposing a protocol for synthetic data generation; (2) by analyzing popular strategies for training and transfer learning using real and synthetic data; (3) and also by defining a curriculum-based strategy to effectively combine multiple sources of synthetic data. We employ the simulation tool Multi-camera System Simulator (MSS) [24] to generate synthetic sets with several configuration options (e.g., number of classes, viewpoints). The synthetic data generated by our protocol is compared against widely used datasets [46, 47] for different training strategies such as combined training [43], and fine-tuning [57]; and also for different proportions of real and synthetic data. Moreover, we proposed a curriculum-based learning strategy relying on the hypothesis that an increasing-complexity data feeding strategy would generalize better to the target real data than a standard-paced (i.e. random) strategy using the same data. We take advantage of our protocol to generate datasets with increasing complexity (defined as the number of instances in the dataset) and use these datasets for curriculum learning. The experimental results show how not only the proposed generation protocol outperforms existing synthetic datasets, but how the combination of different sources and structuring of the training process in an incremental complexity manner can improve state of the art performance. It is noteworthy to highlight that the proposed data generation protocol and training strategies can be applied to any architecture for aligning different synthetic domains to real data, without relying on specific alignment terms.

The contributions of the proposed approach are:

  • A new design protocol for synthetic data generation based on virtual scenario simulators.

  • Identifying and comparing training strategies for weakly supervised domain adaptation in semantic segmentation, measuring the impact of different synthetic sources and different proportions of real data.

  • Proposing a new strategy based on curriculum learning for employing different sources of data, applicable to DNN-based approaches for semantic segmentation.

The paper is organized as follows. Section 2 reviews the state-of-the-art on domain adaptation and synthetic data usage for semantic segmentation. Section 3 introduces the criteria and design protocol for synthetic data generation. Section 4 describes the selected strategies for weakly supervised domain adaptation. Section 5 presents the experimental results, including a comparison with the state-of-the-art. Finally, conclusion remarks are described in Section 6.

2 Related work

2.1 Domain adaptation

A basic assumption in machine learning is having the training and test data sampled independently from an identical distribution. In the context of domain adaptation this assumption is not fulfilled [28], having two domains with clear visual discrepancies: the source data, used for training, and target data, used for fine grained training and testing. Therefore, direct training on the source data leads to a significant performance drop on the target test set. This hindrance is commonly known as domain shift.

Single-source domain adaptation

Alternatively to alleviate the domain shift one may extrapolate knowledge from synthetic to real images in the training process of the model. Depending on whether the ground truth of real images is available during training, this can be further classified into Unsupervised Domain Adaptation (UDA), if not labeled real images are available during training, and Weakly-Supervised Domain Adaptation (WSDA), if a small set of labeled real images are available during training. UDA frameworks employ target RGB images during training to align features from both domains [36, 63, 64, 74, 83]. However, some other works show that effective extrapolation to the real domain can be obtained even without including any real image during training by increasing gradually the complexity of the sample images in a curriculum manner [17, 27, 37]. Following this idea, we propose a synthetic dataset generation protocol to aid the straight-forward implementation of an easy-to-hard image presentation for semantic segmentation. WSDA approaches [9, 57, 69, 81] deploy high performance models for real data adapting from abundant labeled synthetic images but scarce and insufficient labeled real data. Other approaches follow some sort of adversarial learning strategy, [25, 62], which characterizes for the inclusion of an additional, —usually small—, discriminator network which tries to discriminate from the segmentation maps if the input RGB image is real or synthetic. However, adversarial training is generally known as a difficult task due to its instability [63], hence, we do not consider these approaches for this work and will only used for comparative purposes.

In different computer vision fields such as object localization, some works obtain state of the art performance by defining an easy to hard presentation commonly known as curriculum learning [2]. Nowruzi. E. et al. [43] studied the impact of the real data size in weakly supervised object localization. Similarly, Zheng. Q. et al. [57] studied the impact of different pacing strategies when using different ratios of real images. Following this line of work, our proposal aims at further structuring the pacing showcase of different sources of data when compared to the typical finetuning strategy and combined training. To the best of our knowledge, there is no other similar study for semantic segmentation.

Multiple-source domain adaptation

In practice, the source labeled data may come from different domains, such as different simulators for synthetic data or day and night images for real domains, this motivates the research of Multiple Source Domain Adaptation (MSDA) techniques. However, multiple source combined training usually leads to worst performance models than employing one single source for training [19, 77]. In order to overcome this limitation, many authors focus on aligning features from all source domains with target domain features [3, 11, 17, 28, 37, 50, 51, 62, 63, 66, 74, 81]. Three common methods for aligning features in single and multiple source domain adaptation are:

  • Discrepancy based: These alignment frameworks focus on minimizing an explicit distance measure between features obtained in the target and source domains [60]. Various distribution discrepancy metrics have been introduced, including Maximal Mean Discrepancy (MMD) [26], Correlation Alignment [56] and Wasserstein distance [1]. MMD is currently the most widely used metric to measure the distance between two feature distributions [35].

  • Adversarial based: These proposals rely on the inclusion of a discriminator model which measures how domain-discriminative the features generated by the segmentator are. Following the typical adversarial scheme of GANs [25, 62], this training paradigm becomes a min-max game, where the segmentator model aims at fooling the discriminator. In essence, by minimizing the performance of the discriminator the segmentator is minimizing the gap between domains in the feature space [3, 28, 50, 62,63,64, 74].

    Recently, this idea has been made more explicit by adversarial methods defining strategies to translate image appearance from one domain to another. These proposals tend to combine adversarial training with an additional term of consistency. This consistency term measures the discrepancy between the output produced from the original image and the translated image [76]. Intuitively, these proposals minimize the domain shift by aligning features across real and synthetic domain, while maximizing the performance in a mutual domain.

  • Entropy based: These works minimize the entropy on the target domain. As the labels of the target images are not available during training, minimizing the entropy is in a way a self supervision mechanism. Being C as the number of classes, H and W the height and width of the input image, Yt the one-hot encoded C-vector label and the prediction entropy, E, \(E = {\sum }_{C}{\sum }_{H,W} P(X_{t})log(P(X_{t}))\) and the classical cross-entropy loss [11, 63]:

    $$ L_{seg} = \sum\limits_{C}\sum\limits_{H,W} Y_{t}log(P(X_{t})) $$
    (1)

Summary of domain adaptation proposals

Table 1 summarizes the explored proposals dealing with Domain Adaptation in computer vision. All the proposals for semantic segmentation [3, 11, 28, 37, 50, 51, 62, 63, 66, 74, 81] tasks employ deeplab and/or FCN as the segmentator of choice. Hence, for a fair comparison, we employ both architectures in our experiments.

Table 1 Summary of state of art domain adaptation proposals

In particular, we propose to define a protocol to sequentially include different source domains to generalize in a more stable and reliable manner than adversarial and entropy based frameworks [63, 64, 74]. Furthermore, when comparing our protocol to discrepancy based frameworks [35, 60] we do not impose normality nor homogeneity, hence, providing a more relaxed framework extrapolable to any MSDA problem. Alignment free adaptation has proven useful in other computer vision fields such as Object detection. Hintertoisser. D. et al. [27], show that effective extrapolation to the real domain can be obtained without employing any alignment metric to the real images. Specifically, domain adaptation is obtained by generating increasingly complex synthetic images, modifying the scale and point of view of a 3D model of the target object over random backgrounds, thereby, generalizing to the real domain only by structuring the training. In this work, we propose a similar protocol for urban scenes segmentation. We argue that by generating increasingly complex images by modifying factors such as the foreground and background scale, the capture point of view and the number of types of objects, we can define an effective curriculum that generalizes better to the real domain.

2.2 Curriculum learning

Bengio et al. [2] inspired by schooling principles, proposed to train machine learning algorithms by training with basic (easy) samples sooner and the advanced (hard) samples later. In order to define which samples should be included first and which should be included in the training last, Curriculum Learning (CL) needs to define some sort of complexity measure. This complexity can target different hyper-parameters or inputs of the training process such as the target task and the performance measure [55]. However, the complexity is typically measured on a sample-basis, and, as the training continues, the probability to select hard samples for training is increased. In the context of sample-based complexity, different approaches to measure complexity have been proposed, with manual annotation [30, 45] and the performance of a teacher model (e.g., generally a model that has been trained in a standard fashion and is used to probe the samples complexity) [23] as the most used strategies.

In the context of domain adaptation, sample-based curriculum learning has been effectively employed in different tasks such as object detection [27], sentiment classification [80] and image classification [71]. These sample-based curricula can be further classified into sampling-focused [27] and weighting-focused [71, 80]. Weighting-focused curricula attempts to weight the importance of each sample in a batch depending on the training stage, e.g., giving smaller weights to harder samples at its beginning. Luyu et al. [71] and Sicheng et al. [80] propose to train a Manager function which —given an input batch, outputs a scalar as a weight for each sample. These frameworks have the drawback of the computational overhead required for training the Manager. On the other hand, sampling-focused curricula use predefined sample complexities and attempt to automatically select the optimal set of samples given the current status of the model, e.g., somehow defining binary weights for each sample. Hinterstoisser et al. [27] attempt to define a sampling curriculum by generating increasingly complex synthetic images for object localization. By defining a formulation for object localization in terms of scale and rotation angle of a 3D model of a target object, they are capable of generating a sample-based curriculum in which the sample complexity is quantified by these factors. In a similar manner, we propose to generate a dataset which is structured in different levels of sample complexity for semantic segmentation to define a sample-based curriculum for semantic segmentation.

Although employing some sort of curriculum into semantic segmentation has been already attempted, a sample-based curricula for semantic segmentation is yet unattempted. Whereas [74, 75] attempt a curriculum for semantic segmentation by degraining the output of the segmentator from the label distribution to a pixel-wise classification. This curriculum is defined on a task-level, rather than on a sample-basis. Furthermore, they focus on a single source unsupervised domain adaptation compared to our multi-source weakly supervised domain adaptation framework.

2.3 Synthetic data generation for semantic segmentation

Generative-based

Some authors propose to enhance synthetic images realism from simulator to alleviate the domain shift [7, 59, 69]. To this aim, a style transfer architecture is trained to generate new images, following a generator-discriminator approach established by [25]. Although promising, one problem is still unsolved: the generator network can hallucinate new objects [6], which as the ground truth is not modified, will not be present in the ground truth map. Furthermore, requires additional computational efforts to train the generator DNN and currently presents a worse performance than training with synthetic data from simulators [76].

Simulation-based

Two different sources of synthetic data are commonly used, the GTAV [46], the Synthia [47]. Figure 1 includes visual examples of GTAV, Synthia and the proposed MSS datasets, illustrating distinct light reflection, textures and design for their representations. GTAV is composed of 25K images from the game Grand Theft Auto V. In contrast with the other synthetic datasets, GTAV is composed from individual images rather than video sequences. Synthia is generated with a virtual camera placed on a virtual car driving through the city with different weather conditions. All synthetic datasets present images with an appearance aligned to sequences of the Cityscapes dataset. However, GTAV and Synthia do not provide a predefined structure of complexity, impeding the formalization of a incrementally easy to hard presentation of images [2, 27]. In this situation, the MSS provides a functionality to address the above mentioned shortcoming. While the code for annotating a GTAV like dataset is publicly available, one have to buy the original game in order to be able to render and annotate images, hence, impeding the universal usage of the engine.

Fig. 1
figure 1

Images of GTAV, Synthia and MSS datasets with different capture point of views and spatial distributions

2.4 Real datasets for semantic segmentation

In semantic segmentation, three real datasets are usually selected as target domain: Kitti [21], Cityscapes [14] and Mapilliary [42], Fig. 2 includes visual examples for real datasets (Kitti, Cityscapes and Mapilliary). In spite of depicting all urban scenarios, these datasets present inherent visual discrepancies attributed to the different geographic location of each dataset, the car models, the street disposition, the traffic lights and the buildings architecture. Furthermore, each dataset follows distinct design criteria: Kitty provides a smaller dataset with frames captured from the top of a car unlike with other datasets which were filmed with a camera inside of a car. Kitty visually differs not only because of camera position, but due to the lighting conditions, where burnt patches can be found on some of the instances due to a bigger exposure of the camera and drastic light changes from turns of the car facing the sun light directly. Cityscapes was generated by filming with a camera inside of a Mercedes while driving through different German cities. This design implicitly brings unique biases, such as having the Mercedes Logo and the front of the car always at the bottom of the picture. In addition, due to the camera position, little to no sky is present in the frames, in contrast with the other real datasets. Mapilliary on the other hand was generated by filming with different points of view. Most of the sequences were filmed inside a car looking straight through the windshield, however different capturing angles were used, in contrast with Cityscapes, in which the position and angle is consistent through the dataset. In addition, some sequences are filmed from a pedestrian, motorcycle and a touristic bus point of view. Almost 90% of the Mapilliary dataset was filmed from road/sidewalk views in urban areas, the remaining ones are from highways, rural areas and off-road. When comparing real datasets further discrepancies can be found such as: Cityscapes presenting fewer poles when compared to Mapilliary. This is due to the Cityscapes presenting cities where the wiring is located underground, unlike Mapilliary which is obtained from cities where the city wiring tends to be supported by utility poles Due to the small size of Kitty, 500 images, typically it is only used for testing in the literature. In this work we follow this pattern by only using this dataset to assert some hypothesis and not for training.

Fig. 2
figure 2

Images of Kitti, Cityscapes and Mapilliary datasets with different capture point of views and spatial distributions

Figure 3 illustrates how state of art datasets lack a predefined structure of complexity for semantic segmentation, as all datasets are captured using similar points of views with similar objects scales and scene distribution.

Fig. 3
figure 3

Samples of GT labels of existing popular semantic segmentation datasets for urban scenes: Cityscapes, Mapilliary, GTAV, Synthia. Top row presents synthetic images, bottom row presents real images

3 Synthetic dataset based on the MSS simulator

As previously mentioned, current real and synthetic datasets lack a procedure for a generation of images of increasing complexity. Here, we discuss about the criteria for such generation and the obtained synthetic dataset.

3.1 Design criteria

Before creating the dataset, we need to understand how the point of view of the camera may impact the extrapolation to the real domain, which may help to define a design criteria for the MSS dataset. The MSS simulator offers a high degree of freedom for camera placement by allowing wearable cameras on moving objects such as helicopters, cars or pedestrians, and also fixed cameras on specific points in the virtual city. We focus on aligning the point of view to the ones to the target (real) datasets, so synthetic sequences may look similar to the real ones. Table 2 includes the results of an early experiment for training a Deeplab V3 architecture [12] with a dataset composed of one synthetic sequence and a small subset of real images. This experiment allows understanding which capturing point of view has less domain gap as compared to the real domain. As seen in Table 2, we find that fixed cameras and wearable cameras from cars egocentric point of view provided greater improvements than other alternatives. Meanwhile, wearable cameras from pedestrian, helicopter and bus point of view provide points of views which are not present in the target sets. Therefore, training with these sequences yields a worse performance on the target validation set.

Table 2 Preliminary study on the impact of the inclusion of each type of sequence to a 5% random selection of the Cityscapes and Mapilliary train sets for training a Deeplab V3 architecture [12]

By further analysis of the impact of wearable cameras and fixed cameras, see Fig. 4, we find that wearable cameras provide more diversity due to the changing background. However, there is scarcity of some urban elements that are generally less common than straight road sections in the cities, such as turns, roundabouts and intersections. This scarcity turned into models with poor results on unseen spatial allocations, such as intersections where the sidewalk is divided by a road lane without continuation. On the other hand, placing fixed cameras on less common spatial allocations, leads to a better representation of them, however it results into models less accurate on common scenarios as compared to the one trained with car wearable camera sequences.

Fig. 4
figure 4

RGB validation images (left column), results of training with fixed cameras (middle column) and wearable cameras from cars (right column)

The proposed dataset contains sequences from the following types: fixed, car and pedestrian. These sequences are created according to the proposed protocol explained in the next section

3.2 Protocol

The MSS dataset is structured into subsets by the amount of objects present in the virtual scenario, with aims at creating subsets with an increasing complexity. This protocol is inspired by [27] applied to object detection, which accomplishes an increase in performance by sorting the order in which images are shown to the model during training. We hypothesize that learning the general structure of urban scenes can be facilitated by starting with easy examples with few hard instances such as cars, poles and pedestrians. Later, the learned urban layout can be refined with more complex examples. This data ordering gives the model a distinct advantage over using a protocol where the model needs to understand the whole structure from scratch. We captured sequences in the virtual scenario with different points of views and different amount of cars and pedestrians present in the virtual scenario. This allowed a classification of the sequences by point of view and amount of cars and pedestrians. This categorization of the sequences provides an straightforward implementation of a learning protocol where complexity is periodically increased through the inclusion of more complex training examples.

Complexity is parametrized in order of importance as follows: First, the amount of moving agents in the scenario. The amount of moving agents refers to the amount of vehicles and pedestrians acting in the virtual city while filming the sequence. This parameter regulates the maximum amount of non static objects present per image, ranging from 50 to 750. Second, the amount of included points of view. Straight poses are easier to understand for CNNs compared to rotations [27]. Therefore, straight views are considered easier and are predominant in the less complex datasets, while more complex ones include different scales of objects and different points of view. The harder the sequence is intended to be, the further the camera is placed, in terms of degrees with respect to the road and meters from the agents, ranging from + 70 to -70 degrees. This change in the camera angle affects the shape and appearance of objects, hence, increasing the complexity [27]. Finally, the predominant spatial distribution of the sequences. The spatial distribution is graded by the type of sequences employed, ranging from 0-40% of wearable cameras. Pedestrian wearable cameras include predominantly buildings and sidewalks rather than the predominant centered road distribution. Furthermore, pedestrian wearable cameras present wider rotation freedom when compared to a car, which can only turn on specific points of the scenario, making the sequences less stable and harder to extrapolate to the general spatial distribution found when driving a car, see Table 2. Figure 5 includes RGB images (Fig. 5a–c) and labels (Fig. 5d–h) generated following the described protocol and generated using the MSS simulator. Figure 5d and e provide examples of the effect of increasing the complexity by modifying the background, i.e., samples with a more diverse background but the same amount of foreground elements. Figure 5e and f exemplify an increase in complexity by foreground, specifically, an increase in complexity is achieved by introducing more foreground elements (cars and pedestrians) in various scales while keeping the same background objects. Figure 5g and h display modifications on the complexity of a scene without modifying its elements, here an increase in complexity is obtained by moving the camera closer or further away from the scene, evenly changing the scale of all the objects.

Fig. 5
figure 5

Comparison of proposed synthetic GT labels of a,b,c) RGB images of the samples, d) Easy complexity image GT e) Medium complexity image GT f) Hard complexity image GT g & h) Same background different foreground scale and population

Table 3 details the criteria for determining the complexity levels for each MSS subset. Table 4 provides a comparison with related datasets in terms of the proportion of labeled semantic classes. Table 5 presents the comparison of points of views with related datasets. As we can see, the proposed dataset has similar proportions of labeled data as compared to real/synthetic datasets and also allows ranking sequences/subsets by their complexity, which is not available in existing datasets.

Table 3 Complexity factors present on each of the generated (MSSi) and studied datasets
Table 4 Summary table comparing the generated dataset with current state of the art datasets by representation of each of the MSS classes
Table 5 Comparison of real and synthetic datasets for urban scenes segmentation

4 Weakly-supervised strategies for training

Three different strategies are studied to handle weakly supervised domain adaptation: direct combined training, fine-tuning and curriculum learning. The first two strategies are derived from [43, 59, 61], we aim at mimicking scenarios where there is little available data from the target to train. Finally, we propose a curriculum learning strategy aiming at understanding how all synthetic sources can be used in conjunction to reduce the domain gap with respect to the real domain.

4.1 Combined training

Inspired by [43, 59, 61], this strategy trains with only a fraction of the real data in combination with all synthetic data. This strategy allows to measure the impact of each synthetic dataset and also which quantity of real data for training is sufficient to get an acceptable performance level. We train from scratch with a percentage of the real data varying from 5% to 100% of the original dataset mixed with the full synthetic dataset. The results are measured by testing on the corresponding real validation set. When using combined training sets, we expect the model to learn the general concepts from simulated images, and use the real samples to adapt. However, there is no scheduling nor structure in the combined training approach: samples from synthetic and real data are presented at a random pace. Therefore, the fulfillment of these expectations is not guaranteed.

4.2 Fine-tuning

Inspired by [43, 59, 61], this strategy consists on four stages. First, we start with a model with a backbone randomly initialized. Second, we train the full model with synthetic data until convergence, following the procedure described in [43]. Third, we proceed by freezing the weights of the backbone and training the classifier head with the real images until convergence. Finally, we unfreeze the backbone, reduce the learning rate (lr0) by 10 and train with the real subset until the validation MIoU stalls. Figure 6 depicts a graphical representation of the fine-tuning strategy. As the combined training strategy, we also consider different fractions of the full real datasets and perform testing using the real data validation set.

Fig. 6
figure 6

Fine-tuning stategy stages. a) Initial Semantic segmentation model with randomly initialized backbone. b) Training of the full model using synthetic data. c) Backbone freezing and classifier head training with real data. d) Training of the full model using real data with a learning rate ten times smaller than b)

4.3 Curriculum learning

Inspired by curriculum learning for object detection and image classication [27, 30, 32, 33, 45], we propose a new curriculum strategy based on progressively feeding the different datasets and subsets to the model sorted by a predefined complexity order. The proposed complexity is defined by the number of semantic classes present on each synthetic set, and the complexity of each of the datasets. Formally, let Xs,s ∈ [1,N] be each source dataset, with N the number of synthetic datasets, \(\{x_{s},y_{s}\}^{n_{s}}_{i=1} \in X_{s}\) the input images and their ground truth tensors, composed of one one-hot encoded C-length vector label per image pixel, respectively. ns the number of labeled samples for dataset Xs.

Being Ω the trainable set of parameters of the segmentation architecture and x the input sample image, the prediction probability is obtained by G(x;Ω) = Px ∈ (0,1)C, so that \({\sum }_{c=1}^{C} P_{x}^{h,b,c} = 1\) for any (h,b) ∈ ([0,H] × [0,B]), where C are the number of semantic classes and H × B is the image size.

Ω are optimized through stochastic gradient descent by minimizing the cross-entropy loss, (1):

$$ {\Omega}_{t+1}={\Omega}_{t}- lr_{step} \sum\limits_{i=1}^{n}\nabla L_{seg}(x_{i}, y_{i};{\Omega}_{t}) $$

The whole training process is controlled by three hyper-parameters, γ, lr0 and β: We define a training subset \(X_{\{\beta _{s}^{step}\}_{s=1}^{N}} = \bigcup \limits _{s=1}^{N}\bigcup \limits _{i=1}^{n_{s}\times \beta _{s}^{step}}\{x_{s},y_{s}\}\), as one containing only the easiest \(\beta _{s}^{step}\) proportion of each dataset s, β with ranging from 0 − 100%, these proportions are modified at each curriculum step τ ∈ [1,N].

At each curriculum step τ ∈ [1,N] G is trained with \(X_{\{\beta _{s}^{\tau }\}_{s=1}^{N}}\) until convergence with learning rate lrτ = lr0 × γτ, being β0, lr0 and γ methods’ hyper-parameters, the proportion of sampels used at each step is defined by:

$$ \beta_{s}^{\tau} = \left\{\begin{array}{ll} (\beta_{0})^{(s-\tau)} & \text{if } s \leq \tau \\ 0 & otherwise \end{array}\right. $$
(2)

This dataset addition and re-training is repeated until each dataset has been used for training. Finally, once all the training stages are concluded, we perform a final fine-tuning stage, as depicted previously, with only the target real training dataset until convergence.

Algorithm 1 summarizes the steps of this training strategy

Algorithm 1
figure a

Curriculum learning procedure for domain adaptation in semantic segmentation.

5 Experimental results

We provide an analysis of the semantic gap between real and synthetic domains (Section 5.2). Then, we validate the utility of the MSS dataset for combined training (Section 5.3) and fine-tuning (Section 5.4). We also compare the results against Synthia, the largest publicly available synthetic dataset similar in size to the proposed MSS dataset. We present the results for the proposed curriculum learning strategy (Section 5.5), that combines several synthetic datasets employed in the literature. Finally, best achieved results are compared with the state of the art (Section 5.6) followed by a discussion of mayor findings and a qualitative comparisons (Section 5.7).

5.1 Experimental setup

Two performance evaluation metrics are employed: Mean Intersection over Union (MIoU) and Mean Pixel Accuracy (MPA) [44]. MIoU is widely used to measure the similarity between two subsets as the area of their overlap against the union of the areas of each subset. MPA refers to the percentage of correctly classified pixels. For all strategies, the common and specific training parameters are detailed in Table 6. These parameters have been determined considering similar state-of-art proposals [10, 12, 40].

Table 6 Training configuration

Five datasets of data are employed: Mapilliary and Cityscapes compose the real sources whereas MSS, Sytnhia and GTAV compose the synthetic sources. See Section 2 for further details. In the different experiments, we also make use of proportions for each dataset (see Table 7) to assess the performance impact for varying-size sets of data.

Table 7 Size and proportions of datasets

5.2 Baseline: training with only real or synthetic data

As baseline for further comparisons, we consider models trained from scratch with only one source of data. In this experiment we analyze the domain gap between pairs of train-test datasets (synthetic-real and real-real). Table 8 includes the results of training a Deeplab V3 architecture [10] using only one source dataset untill the MIoU stalls on their own validation set [70]. Then, we validate on each of the real target validation sets. These results show a noticeable drop in performance when testing on a different validation set rather than the source one. The only one which consistently presents better results on all three test sets is the Mapilliary train dataset. We believe that the better transfer capabilities of the Mapilliary case is mainly due to the big size gap between the datasets, refer to Table 7 for a size comparison. When comparing synthetic sets, we find that MSS dataset presents a smaller domain gap when compared to the Synthia dataset (see Table 8). Additionally, it can be seen how the combination of both synthetic datasets, All synthetic, outperforms using any of the synthetic datasets. Despite the clear domain shift between real and synthetic data, the performances obtained training only synthetic data do not differ drastically from the ones obtained when using different real datasets as source and target domains, a situation that has been already observed in object detection [43]. In the context of this paper, aiming to improve semantic segmentation in urban scenarios, these differences are aggravated by factors such as the high diversity in car models, street appearances and lighting conditions between source and target datasets.

Table 8 Results of training Deeplab V3 with a ResNet101 backbone [10] with one source and testing on each of the real validation sets

Impact of the synthetic dataset size As we are proposing a new dataset which includes over 200K new synthetic images, we want to ensure that the amount of generated images is relevant. To that aim, we measure the impact of employing only a subset of samples in the synthetic sets, ranging from 5% up to 100% of samples for each of the studied datasets. Figure 7 represents the impact of the synthetic dataset size employed in the downstream performance of the model trained on the target real data. It can be seen how as the number of images is increased, the downstream performance is also increased for all the studied synthetic datasets. However the gain in performance is not linearly correlated with the size. For instance, for both synthetic datasets, employing 5 times more of data yields an 13.2% increase in performance (from 17 to 19.5 MIoU), while employing 10 times more samples provides a 21.7% increase in performance (from 17 to 21 MIoU).

Fig. 7
figure 7

Impact of synthetic dataset size for baseline training using a DeeplabV3, tested on the target validation set Cityscapes

5.3 Combined training: concurrent synthetic-real data usage

We apply the combined training strategy defined in the Section 4.1 to train a Deeplab V3 with a ResNet101 backbone [10]. We employ proportions of the synthetic datasets for assesing their influence on the final performance (see Table 7). As for testing, we use only the real validation sets. Tables 9 and 10 compile the performances obtained by training with different proportions of the Cityscapes and Mapilliary datasets respectively in conjunction with the complete synthetic datasets, similar results are obtained when employing a Fully Convolutional Network (FCN) [40]. This initial experiment indicates that the MSS dataset can compete favorably with Synthia, specially in scenarios where less real data is provided (see Table 9). Differently, as indicated by the performances in Table 10. Furthermore, we can see how the drop in performance is not linear to the amount of real data employed, employing 50% of the target data reduces the performance in less than 15% for both studied real datasets.

Table 9 Combined training with Cityscapes using a Deeplab V3 with a ResNet101 backbone [10], tested on the Cityscapes validation set
Table 10 Combined training with Mapilliary using a Deeplab V3 with a ResNet101 backbone [10], tested on the Mapilliary validation set

5.4 Fine-tuning: using pre-training from synthetic data

We apply the fine-tuning strategy defined in the Section 4.2 to train two architectures: Deeplab V3 with a ResNet101 backbone [10] and Fully Convolutional Network (FCN) [40]. We employ proportions of the synthetic datasets for assesing their influence on the final performance as defined in Table 7. As for testing, we use only the real validation sets. Consistently with the previous experiment, Fig. 8a and b illustrate that the proposed MSS-based dataset outperforms the Synthia dataset on both the Cityscapes and Mapilliary datasets when it is used to train a model that is later refined with real data. Furthermore, in this experiment it is shown that the finetuned models provide better performance than the combined training strategy (see Tables 9 and 10) and the baseline (see Table 8). All synthetic refers to the combination of Synthia and MSS.

Fig. 8
figure 8

Fine-tunning with portions of target train set using a Deeplab V3 [10] and Fully Convolutional Network [40] (FCN), tested on the target validation set. The baseline corresponds to the model trained with the full target set

For both real datasets, using an initial training on synthetic data to transfer knowledge to the real domain proves beneficial when compared to the baseline case (i.e. training only with real data). Furthermore, we observe the non-linear relationship between the percentage of real data introduced and the performance gain. With only a 5% of real data we can get up to 70% of the performance compared to training with the full dataset.

Impact of the synthetic dataset size

In order to measure the impact of synthetic dataset size, Fig. 9 agglutinates the impact in the performance on real-data validation of the number of dataset samples used together with real data for training two models, one for each of the explored synthetic datasets. We can see how consistently employing larger datasets provides a better performance, hence, motivating the usage of our proposed dataset.

Fig. 9
figure 9

Impact of synthetic dataset size for finetuning training using a DeeplabV3, tested on the target validation set Cityscapes

5.5 Curriculum learning

We explore the application of a new strategy based on Curriculum Learning (see Section 4.3) to train two architectures: Deeplab V3 with a ResNet101 backbone [10] and a FCN [40]. We employ all synthetic datasets in a sequential manner, periodically including new sources according to the complexity criteria defined in Section 3.2. The defined order of the datasets is (in increasing complexity, see Table 3): MSS50, MSS100, MSS250, MSS750, Synthia and GTAV.

Two hyper-parameters are used as defined in Section 4.3. γ regulates the learning rate and β regulates the amount of images which are kept from the previous training set. Table 11 presents the hyper-parameter study using a FCN architecture trained and validated with Cityscapes as the target dataset, from this analysis we set γ = 0.9 and β = 0.75, due to the common patterns and tendencies in performance found in previous experiments (see Sections 5.4 and 5.3), we set the same parameters for both architectures.

Table 11 Hyper-parameter study for curriculum learning using a FCN [40] model with Cityscapes as the target set

Tables 12 and 13 show the evolution of the results using a FCN and a Deeplab V3 respectively, we can see a MIoU increase as each new dataset is included, which affects to all classes. For comparison with the state of the art, only Cityscapes is included, however results extrapolate to Mapilliary. We observe lower improvements for less representative semantic classes (e.g, sign, pedestrian and pole) until increasingly complex synthetic sources are added to the training set. We believe this issue to be because of the broad appearance gap between synthetic and real images. However, as more synthetic sources are added, the model is forced to look for shape similarities rather than color and texture. Hence, it produces producing big jumps in performance once a new synthetic dataset is included. For more representative semantic classes (e.g., road, sidewalk, building and vegetation), new synthetic sets reinforce performance in two ways. Regarding the first factor, note that these semantic instances are heavily location biased (see Fig. 10) and that bias is common to all datasets. Therefore, including new synthetic datasets seemingly reinforces the model to rely on this location patters rather than appearance. Regarding the second factor, note that as new semantic labels are added, less pixels are wrongly labeled as those broader classes.

Table 12 Evolution in the training performance when each dataset is included in the curriculum learning scheme with a FCN
Table 13 Evolution in the training performance when each dataset is included in the curriculum learning scheme with a Deeplabv3
Fig. 10
figure 10

Heat-map of semantic classes location probability of the Cityscapes dataset

Finally, experimental results validate the curriculum hypothesis, as the DNN trained using our curriculum over-perform standard-paced random training on synthetic data alone by 36.62% and 20% MIoU when employing only the MSS dataset and both Synthia and MSS datasets respectively, see Tables 8 and 13. This also applies for the scenario where only real data is employed and in conjunction with synthetic data for training, as the baselines are surpassed by a 6.1% and a 7.9% respectively for FCN (see Table 14) and 3.4% and 11.9% respectively for DeeplabV3 (see Tables 89 and 13).

Table 14 Comparison of state of the art weakly-supervised domain adaptation approaches with selected architectures on the Cityscapes validation set

5.6 Comparison with state-of-the-art methods

Table 14 compares the proposed strategies with related works employing two widely popular architectures in semantic segmentation (Deeplab and FCN). Authors of NAE [57] performed a experiment similar to our combined training (CT) and fine-tuning (FT) without the MSS dataset. We can see how our CT leads to a worse performance than NAE due to having a greater amount of synthetic images, hence, decreasing real images ratio in the complete dataset. However, the inclusion of the MSS dataset is advantegeous in the FT strategy, providing richer initial weights with a 0.4 MIoU gain in performance after fine-tuning. Our curriculum strategy (CL) achieves competitive performance to state-of-the-art baselines for both Deeplab V3 and FCN architectures, with a 2.6 gain in MIoU with the baseline, and a 0.6 gain to the state of the art [5] without relying on the training of a discriminator network, making our method a more stable and reliable approach when compared to the other alternatives.

Table 15 compares the best result of the analyzed strategies (i.e. Curriculum learning, CL) against related work in segmantic segmentation. Only convolutional-based architectures are considered for this comparison to grant fairness. Results are provided for the Cityscapes validation set. Compared to the other models, our proposal main improvements are achieved on static classes. While [12, 13, 58, 79] present close performances to ours, our model has less than half the amount of parameters. Finally, [68] is attention-based, hence, we have employed their provided code to train a Deeplab + ResNet 101 which is publicly available on their selected framework [73].

Table 15 Comparison with different state of the art supervised supervised methods for semantic segmentation on the Cityscapes validation set

We believe the main advantage of the proposed CL strategy is a better learning of the urban scenes topology. Our proposal achieves state of the art performance on semantic classes which are persistently located in similar areas of the image like: road, sidewalk, building, wall and fence. We attribute this improvement to the repeated training on different sources of urban scenes images, hence, reinforcing through every iteration of the curriculum strategy the spatial configuration of the scene. As depicted in Fig. 10, some classes present a high location bias.

5.7 Discussion

Results One of our major findings is how much scheduling the training can impact the final performance, see Table 14. Using the same DNN, training time budget and data, we are able to improve the performance up to an 8%. Secondly, for the fine-tuning strategy, the pre-training on synthetic data leads to small fluctuations between different synthetic sets when the full real dataset is used. However, when there is little target data available, the gap in performance between synthetic datasets grows up to a 23%. Finally, following the proposed protocol for synthetic data generation (see Section 3), we managed to generate a dataset which has proven consistently useful for training compared to the most similar publicly available synthetic dataset: Synthia (see Tables 8910 and Fig. 8a, b).

Qualitative results of our model

Figure 11 presents a qualitative comparison between the generated semantic segmentation of the models trained with only real data, synthetic data and a combination of both. Employing the proposed MSS datasets seems to be aiding the model with better discrimination of smaller structures such as signs, fences and walls. Furthermore, the proposed approach appears to provide a finer-detailed segmentation when compared with the proposed FT strategies, where one can observe a fuzzier discrimination of classes such as rider (1st row), fence and wall (2nd row), bike (3rd row) and sidewalk (4th row). Figure 12 compares different state of art alternatives ([68, 82]) with our DeeplabV3 model trained employing CL. While state of the art models provide finer details on farther and smaller structures —such as gaps between traffic signs, our model predicts more reliably structures like buses, sidewalks and fences.

Fig. 11
figure 11

Qualitative results of semantic segmentation results on the Cityscapes validation set. For each target image in the first column we present the GT (a). For the second to forth column we retrieve the results of the proposed finetuning without pretraining on the MSS dataset (b), proposed finetuning with pretraining on the MSS dataset (c), Proposed curriculum (d)

Fig. 12
figure 12

Qualitative results of semantic segmentation results on the Cityscapes validation set against the state-of-the-art semantic segmentation proposals. For each target image in the first and second column we present the RGB image and its respective Ground truth map (a,b). For the third and fourth column we retrieve the results of two state of the art segmentation DNN, ContrastiveSeg and ProtoSeg (c,d) respectively. The last column is a DeeplabV3 model trained employing our CL proposal

6 Conclusion

In this paper our contribution is three-folded. First, we propose a new synthetic data generation protocol. By using the MSS simulator, we generate a new synthetic dataset for semantic segmentation which is composed of four different subsets ordered in terms of complexity, defining the complexity in terms of the amount of smaller semantic instances present in the virtual scenario. Second, we analyze the impact of introducing synthetic data using different architectures for semantic segmentation in urban scenes. We explore two different strategies to abstract synthetic data knowledge to the real domain: combined training and fine-tuning. Third, we propose a new curriculum learning strategy based on a complexity analysis of the generated data with the proposed protocol. When handling domain adaptation, we find that structuring the learning of the model leads to significant boost in performance, having combined training as the least optimal approach, sometime leading to worse models than using solely real data. Pre-training with synthetic data and fine-tuning with limited real images provides better results than training with all sources jointly. Moreover, a structured learning where images are presented in an increasing complexity manner leads to better understanding of the scene. This progressively pacing leads to a better learning of broader structures such as roads and buildings first, allowing later epochs to be focused on understanding small-sized semantic instances such as pedestrians, traffic lights and poles. This approach differs from current state of the art approaches by being model agnostic, hence, can be applied to any architecture. The results of the experiments also suggest that realism is not the only key factor of synthetic data, the content of each image and the inclusion pace of the synthetic images to the model during training are also a key factors barely analyzed in the literature.

In this work we have studied and validated the benefits of structuring and arranging data in a sample-based curriculum learning paradigm. As potential improvements of this work, we envision the incorporation of an explicit domain adaption technique to further narrow the real and synthetic domains. Moreover, we have a simple, yet effective, definition of sample complexity (i.e., number of moving object instances) for a single virtual scenario. This can be extended by incorporating additional complexity factors such as multiple viewpoints, number of semantic labels or even the usage of several virtual scenarios. Finally, it is important to highlight the unpaired number of classes between real and synthetic datasets. This mismatch leads to a performance degradation when evaluating using real data.