Road Scenes Segmentation Across Different Domains by Disentangling Latent Representations

Deep learning models obtain impressive accuracy in road scenes understanding, however they need a large quantity of labeled samples for their training. Additionally, such models do not generalise well to environments where the statistical properties of data do not perfectly match those of training scenes, and this can be a significant problem for intelligent vehicles. Hence, domain adaptation approaches have been introduced to transfer knowledge acquired on a label-abundant source domain to a related label-scarce target domain. In this work, we design and carefully analyse multiple latent space-shaping regularisation strategies that work together to reduce the domain shift. More in detail, we devise a feature clustering strategy to increase domain alignment, a feature perpendicularity constraint to space apart features belonging to different semantic classes, including those not present in the current batch, and a feature norm alignment strategy to separate active and inactive channels. In addition, we propose a novel evaluation metric to capture the relative performance of an adapted model with respect to supervised training. We validate our framework in driving scenarios, considering both synthetic-to-real and real-to-real adaptation, outperforming previous feature-level state-of-the-art methods on multiple road scenes benchmarks.


I. INTRODUCTION
One of the key components of a self-driving vehicle is the capability of understanding the surrounding environment from sensory input data.Semantic segmentation enables profound scene understanding, where all pixels of the input images are assigned to a semantic category corresponding to key elements to be detected, such as the road, other vehicles or traffic lights and signs.Nowadays, such task is commonly tackled with Deep Convolutional Neural Networks (DCNNs), which have achieved outstanding results in image understanding tasks, provided that a sufficiently large number of labeled examples are available from the target input domain distribution.On the other side, the annotation of thousands of images of road scenes is highly expensive, time-consuming, error-prone and, possibly, worthless, since the test data can show a domain shift with respect to the training labeled samples.Hence, recently, a strong requirement emerged for research and development of autonomous driving systems: namely, of being able to train DCNNs with a combination of labeled source samples (e.g., synthetic from ad-hoc simulators or driving video games) and unlabeled target samples (e.g., real-world acquisitions from cameras mounted on cars), with the aim of getting high performance on data following the target distribution.The need for large quantities of labeled target data is superseded by data coming from a source domain where samples are abundantly available and annotations are faster and cheaper to generate.
Unfortunately, DCNNs are prone to failure when they are shown an input domain distribution other than the training one (domain shift phenomenon).In order to deal with this problem, various Unsupervised Domain Adaptation (UDA) techniques have been developed to adapt networks at different stages (the most common are the input, feature and output levels) [1].
Deep learning models for semantic segmentation are mostly based on encoder-decoder architectures, i.e., they build some concise latent representations of the inputs, which are highly correlated with the classifier output.As such, they are used in the subsequent classification process [2], [3] that reconstructs the full resolution segmentation map.Nevertheless, a smaller number of UDA techniques for semantic segmentation work in the feature space because of its large dimensionality.In this paper, we propose one such approach, comprised of a new set of strategies working at the latent space level, building on top of our previous conference work [4].By employing a shaping objective in such place, our aims are to promote classaware features extraction and features invariance between source and target domains.We remark that, while the general target behind each strategy is similar to the one of [4], their actual implementation was overhauled, resulting in a significant performance increase.
Firstly, a clustering constraint groups feature vectors of each class tightly around their prototypical representation.Secondly, a perpendicularity objective over the class prototypes promotes disjoint filter activation sets across different semantic categories.Finally, a regularization-based norm alignment objective enforces consistent vector norms in the source and target domains, while jointly forcing progressively increasing norm values.This, in combination with the perpendicularity constraint, is able to reduce the entropy associated with the feature vector channel activations.
Importantly, the proposed techniques require the generation of accurate class prototypes and the imposition of a strong correlation between feature representations and predicted segmentation maps.Hence, we also propose a novel strategy to map semantic information from the labeling maps to the low resolution feature space (annotations downsampling).
This paper moves from our previous work [4], which already achieved state-of-the-art results on feature-level UDA in semantic segmentation.Compared to the conference version, this journal extension introduces several novel contributions.
First of all, the computation of prototypes and feature vector extraction have been refined.The first now considers the prototype trajectory evolution for a better estimation (Sec.III-B), while the second exploits target information to reduce the domain shift (Sec.III-C), additionally a class-weighting scheme is used in the source supervision (Sec.III-A).
Then, each of the three proposed space-shaping constraints has been improved and additional ablation studies are shown both for our approach, LSR + (Sec.VIII), and for the proposed evaluation metric, mASR (Sec.VI).In particular, the clustering objective was modified to be more resilient to outliers (Sec.IV-A); the perpendicularity constraint now accounts for classes not present in the current batch (Sec.IV-B); the norm alignment now ignores low-activated channels (Sec.IV-C).
Finally, extensive experiments have been conducted on many road scenarios, expanding the set of experiments reported in [4].The results are evaluated on 4 backbones and 6 setups.These include not only 2 synthetic-to-real ones, commonly used in related works, but also 4 real-to-real settings addressing the critical issue of generalizing of autonomous driving systems across different cities and types of roads in different regions of the world.Additional results using the unlabeled Cityscapes coarse set [5] are reported, showing significant performance gains when more unlabeled data are used (see Table I).
Unsupervised Domain Adaptation consists in transferring knowledge extracted from a label-rich source domain to a completely unlabeled target domain.The ultimate objective is to address the performance decline caused by domain shift, which negatively affects the generalization capabilities of deep neural networks.The problem was initially studied for the classification task, but recently many works dealt with the unsupervised adaptation problem in relation to semantic segmentation.Although several methods have been proposed to tackle the adaptation task, they all share an underlying search for a form of domain distribution alignment over some representation space.Some methods pursue distribution matching inside the input image space via style transfer or image generation techniques, others aims at bridging the statistical gap between source and target representations produced by the task model, whether manipulating some output representations, or operating inside a latent feature space [1].
Input-space adaptation has been commonly addressed resorting to image-to-image translation [19]- [24].By transfer-ring visual attributes across source and target samples, domain invariance is achieved in terms of visual appearance.Source supervision can thus be safely exploited in the shared image space, retaining consistent accuracy on source and target data.
As concerns feature and output-space adaptation, adversarial learning has been largely employed to bridge the statistical domain gap [25]- [31].With the help of a domain discriminator, the task network is forced to provide statistically indistinguishable source and target representations, typically drawn from a latent feature space [25]- [27] or in the form of probability maps at the output of the segmentation pipeline [27]- [31].More recently, some works focusing on featurelevel regularization have been proposed [4], [32].In [32] a class-conditional domain alignment is achieved by means of a discriminative clustering module, paired with orthogonality constraints to enhance class separability.The approach of [4] relies on conditional clustering adaptation, enhanced by a perpendicularity objective over class prototypical representations and a novel norm alignment loss to improve class separability at the latent space.As an alternative form of feature-level adaptation, dropout regularization has been explored [33]- [35]; decision boundaries are pushed away from target high density regions in the latent space without direct supervision.
Output-space adaptation has been further pursued resorting to self-training [36], [37], where the learning process is guided (in a self-supervised manner) by pseudo-labels extracted from target network predictions.Self-supervision has been proposed in a curriculum learning fashion as well [38], [39].First, simple tasks that are less sensitive to domain shift are solved, by inferring some useful properties related to the target domain.Then, the extracted information is exploited to address more complex learning tasks (e.g., semantic segmentation).Alternatively, some works introduce entropy minimization techniques [40], [41], which force more confident network predictions over target data, thus encouraging the behavior shown in the supervised source domain.
Latent Space Regularization has been shown to ease the semantic segmentation tasks in different settings, such as UDA [42], [43], continual learning [44] and few-shot learning [45], [46].The idea is to embed additional constraints on feature representations during the training process, enforcing a regular semantic structure on latent spaces of the deep neural classifier.In UDA, where target semantic supervision is missing, regularization can be applied in class-conditional manner by relying on the exclusive supervision of source samples, while indirectly propagating its effect to target representations as well.Such improved regularity has, in fact, shown to promote generalization properties, leading to statistical alignment between the source and target distributions when regularization is jointly applied over both domains [4], [32].
A multitude of feature clustering techniques based on the K-Means algorithm have been proposed [42], [43], [47], [48] to address the adaptation task.Those works are mainly focused on image classification and resort to a projection to a more easily-tractable lower-dimensional latent space where to perform pseudo labeling of the original target representations extracted by the task model [43], [47], [48].In [4], [32] the idea is further refined and applied to semantic segmentation by proposing an explicit clustering objective paired with orthogonality constraints to force feature vectors to cluster around the respective class prototypes.Feature-level orthogonality has been also explored in [49] to limit the redundancy of the information encoded in feature representations.Approaches closer to our strategy are [50], [51], where UDA is promoted via an orthogonality objective over class prototypes.Nonetheless, [49]- [51] all limit their focus to the image classification task.

III. PROBLEM SETTING
In this section we overview our setup, detailing the mathematical notation used throughout the paper.We start by identifying the input space as X ⊂ R H×W ×3 and the corresponding label space as Y ⊂ C H×W , where H and W represent the image resolution and C the class-set.Furthermore, we assume to have a training set from a source domain, while an additional set of input samples is drawn from an unlabelled target domain (X t n ∈ X t ).We adapt the knowledge of semantic segmentation learned on the source domain to the unsupervised target domain.Superscript s identifies the source domain, while t the target.
As done by most recent approaches for semantic segmentation, we assume a task model S = D • E based on an encoder-decoder architecture, that is, made by the consecutive application of an encoder network E (referred to as backbone, which acts as feature extractor) and a decoder network D, which actually performs the classification and produces the segmentation map.We denote the features extracted from an input image X as E(X) = F ∈ R H ×W ×K

0+
, where K refers to the number of channels and H × W to the lowdimensional (feature-level) spatial resolution.Thanks to the topology of encoder-decoder DCNNs for semantic segmentation, classes are encoded into ideal latent representations, invariant with respect to the domain shift.The strategies presented in Sec.IV enforce this goal by comparing the extracted features belonging to each class with the respective prototypical representations.In the following paragraphs, we present the techniques used to compute the prototypes and associate feature vectors to semantic classes.
A. Weighted Histogram-Aware Downsampling.Given that most of the spatial information of an image is maintained while it it processed by an encoder-decoder network, we can assume a tight relationship between any feature vector (i.e., the vector of features associated to a single spatial location within the feature tensor) and the semantic labeling of the corresponding image region.
Hence, the extraction process begins with the identification of a way to propagate the labeling information to latent representations (decimation), without losing the semantic content of the window (image region) corresponding to each feature vector.A naïve approach, which allows wrong mappings, would strongly affect the whole following procedure.Our solution is a non-linear pooling function, which instead of computing a simple subsampling (e.g., nearest neighbor) extracts a weighted frequency histogram over the labels of all the pixels in the window corresponding to a low-resolution feature location.The weights are inversely proportional to the classfrequency in the source training dataset.Then, these metrics are used to select the most appropriate class for each image region, producing source feature-level label maps {I s n } Ns n=1 .The computation of the target counterparts ({I t n } Nt n=1 ) is discussed in Sec.III-C and we remark that each I s,t n ∈ C H ×W .In particular, we choose the label with the highest frequency peak in the windows, only if such peak is relevant enough, i.e., if all other peaks are smaller than T h times it (a similar approach is found in the orientation assignment step of the SIFT feature extractor [52]).Empirically, we set T h = 0.5.Finally, we remark a useful side-effect of this technique: whenever a window cannot be uniquely assigned to a class (that is, it contains multiple labels) the procedure automatically assigns it to the void class. B where the couple [h, w] denotes the spatial location (0 ≤ h < H and 0 ≤ w < W ). The definition is further expanded into the set of all feature vectors in batch B by taking their union with the set F s,t v of samples belonging to class void: From these sets we can extract the batch-wise prototypes of each class (note that we use feature vectors exclusively from the source): Moreover, with the goal of obtaining more stable and reliable prototypes, and to reduce estimation noise, we consider the exponentially smoothed vectors: The parameters are initialized with pc = 0 and η = 0.8 (empirically).In our notation, p c represents the estimate at the previous optimization step, while p c the one of the current.This way, by setting η < 1, we can propagate the previous estimates to the current batch, allowing to consider classes absent from Y s n in the loss computation.C. Feature pseudo-labeling.While the histogram strategy can be seamlessly extended to be used with pseudo-labels (i.e., network estimates for the unlabeled target samples, as was our strategy in the previous work [4]), this approach can introduce instability in the training procedure.To avoid such issue, we devise a novel way of extracting the target feature-level label maps {I t n } Nt n=1 .Our strategy exploits the euclidean distance in the latent space, computing a clustering of the feature vectors around their prototype (see Fig. 1).More in detail, we compute an initial classification exploiting the prototypes computed over the source labeled data, which, due to the domain shift, will not be adequately representative of the target distribution: Where σ c (•) is the softmax function computed over the classes.Then, we refine the classification keeping only those vectors that have a high classification confidence according to a probability distribution attained through a softmax function:

IV. METHODOLOGY
The proposed approach is detailed in this section, highlighting the key differences with respect to our previous work.Our investigation moves from the fact that the discriminative effect acquired by the model with the source supervised cross-entropy objective may not be propagated to the target domain due to the distribution shift.To tackle such problem, in [4] we proposed to use additional space-shaping objectives to increase the network generalization capability, therefore improving robustness to distribution shifts from the original source training data.In particular, we added three featurespace shaping constraints to the standard source-supervision (L s CE ), whose combined effect can be expressed as: Here, L C represents the clustering objective acting on the feature vectors (Sec.IV-A), L P the perpendicularity constraint applied to class prototypes (Sec.IV-B) and L N the norm alignment goal (Sec.IV-C).To simplify the notation, Eq. ( 6) contains each loss component with s, t superscripts to indicate the sum of the loss on source and target samples.To further improve the performance and to show how the proposed techniques can be used on top of existing strategies, we also add to the optimization target the entropy minimization loss introduced by Chen et al. [40], obtaining: By doing so, we also show that our space-shaping objectives provide a different and complementary effect on the feature vectors when compared to the entropy minimization constraint.An overview of the proposed strategy is presented in Fig. 2.

A. Clustering of Latent Representations
Due to the distribution discrepancy between source and target domains, feature vectors originating from them will be misaligned.This inevitably causes some incorrect classifications of target representations, in turn degrading the segmentation accuracy in the target domain.We introduce our first loss, a clustering objective over the latent space, to mitigate this problem, seeking for class-conditional alignment of feature distribution.We do so by exploiting the prototypical representations discussed in Sec.III and forcing the feature vectors from source and target representations to tightly cluster around them: representations are adapted into a common classwise distribution and the disciminativeness of the latent space is increased.
Differently from the previous work, we define the clustering objective as the L1 distance between feature vectors and their associated class-prototypes.This results in a more stable training evolution and lower error rate in clustering, thanks to the outlier-rejecting properties of the L1 norm.In particular, due to the quadratic nature of the L2 loss, outliers with distances greater than 1 have a strong push towards the clusters.On the other hand, the L1 loss is stronger than L2 for close samples, which are more representative of each class, and is significantly gentler than L2 for distant outliers.The loss can be expressed mathematically as: This loss has multiple targets: the first is the increased clustering of the latent representations thanks to label supervision, which reduces the tendency to erroneous predictions.The second one is to perform self-supervised clustering on target samples using our two-pass pseudo-labeling strategy (see Sec. III-C).Finally, it leads to better prototype estimates, due to the fact that forcing tighter clusters will lead to more stable batch-wise centroids, which in turn will get closer to the moving-averaged prototypes.

B. Perpendicularity of Latent Representations
A prototype perpendicularity loss is further proposed to aid the latent space regularization brought by the clustering objective.Our goal is to induce compact and domain-aligned feature clusters, in order to boost the accuracy of network segmentation maps.As a direct consequence, the margin between classification boundaries and feature clusters is expanded, thus decreasing the probability that target high-density regions are traversed by such boundaries.We directly encourage a classwise orthogonality property, not only increasing the distance among class clusters, but also reducing class cross-talk by discouraging shared channels activations in distinct categories.
In the loss, we encode the perpendicularity score exploiting the definition of euclidean space inner product: j • k = ||j|| ||k||cos θ, where θ is the angle internal to the two vectors j and k.To maximize θ we just need to minimize the vectors' normalized product (recall that j, k ∈ R K 0 + ).Therefore, the cross-perpendicularity between prototypes is encoded as: Eq. ( 8) computes the sum of the cosines over the set of all couples of non-void classes.The influence of the orthogonality objective indirectly reaches all feature vectors, as prototypical representations and single feature instances share a strong geometric bound promoted by L s,t C .What we ultimately achieve is thus to enforce a perpendicularity constraint among instances of different clusters, with a homogeneous action over all latent representations from the same semantic class.In other words, the angular gap among distinct semantic categories in the feature space is enlarged, by inducing disjoint patterns of activated feature channels between distinct classes.
In contrast to our previous paper [4], we compute the loss on the exponentially smoothed version of the prototypes, i.e., from Eq. ( 3).This guarantees that the space will be more evenly occupied by the classes, since all directions are considered in the computation of the loss, instead of considering only the ones in the current batch.

C. Latent Norm Alignment Constraint
This loss term is computed exploiting source and target feature vector norms.More in detail, we enforce norm consistency between the latent representations extracted from the two domains.This has two objectives: firstly, we aim at an improved classification confidence on target predictions, as done by adaptation techniques using entropy minimization in the output space [41].Secondly, we assist the perpendicularity constraint by reducing the number of domain-specific feature channels used by the network for classification.Thirdly, we reduce the number channels enabled only on one of the domains, which would lead to norm misalignment.Moreover, to reduce the possible decrease in norm value during the alignment process, we introduce a regularization term that promotes norm increase.Differently from [4], here the norm objective is encoded as a relative difference with a regularization term inversely proportional to the norm value.This allows to obtain a value-independent loss where norm values higher than the target are less discouraged.Moreover, we introduce a norm filtering strategy to reduce the negative effects a careless increase in norm could imply.In particular, we suppress low channel activations, stopping the gradient flow through them and preventing the norm alignment procedure to increase their value, in contrast to what source supervision indicates.Formally, we define the loss term as: where fs is the average source vector norm (extracted in the previous optimization step), ∆ f dictates the regularization strength (experimentally tuned to 0.1) and F s,t * is a thresholded version of F s,t where we set to 0 the lowactivated channels of each feature vector, stopping the gradient propagation: This objective is applied in a completely unsupervised manner, the vector norms are forced to align to the same value regardless of their class.In this way we remove the bias generated by heterogeneous pixel-class distribution in semantic labels, that, for example, lead features of the most frequent classes to have larger norms than average.The constraint of Eq. ( 9) forces the inter-class alignment step, i.e., it promotes gradual alignment of the norms towards a target common to all categories, while discouraging the value of such target to decrease too rapidly.An additional benefit of rescaling the loss by the norm target is that the loss gradients will be limited in magnitude and, therefore, more stable.

V. IMPLEMENTATION DETAILS
Training Data.We evaluated our approach (LSR + , Latent Space Regularization) on road scenes segmentation in various synthetic-to-real and real-to-real unsupervised adaptation tasks.For the source domains we used the synthetic datasets GTAV [17] and SYNTHIA [18].The first contains 24, 966 labeled images at a resolution of 1914×1052 px, produced with the rendering engine of the GTAV videogame, while the second contains 9, 500 labeled images at a resolution of 1280 × 760 px, rendered with a custom software.For the target domain, we selected the real world dataset Cityscapes [5].It contains 5, 000 labeled images at a resolution of 2048×1024 px and an additional set of 20, 000 coarsely labeled samples, acquired in European cities.When considering only unlabeled samples the two versions are equivalent and can be merged (obtaining a dataset we refer as CS-full) increasing the adaptation process (as we show in Table I).In the real-to-real setup we used the Cross-City benchmark, where the Cityscapes dataset takes the role of source domain, while the Cross-City dataset [53] takes the role of target.Such dataset is comprised of 12, 800 (4 × 3, 200) high resolution (2048 × 1024 px) images taken in four major cities (Rome, Rio, Tokyo, Taipei).
We trained the model in a closed-set [1] setting, i.e., with the same source and target class sets.More in detail, we used the 19, 16 and 13 classes in common for GTAV, SYNTHIA and Cross-City, respectively.GTAV, Cityscapes and Cross-City images have been rescaled for training to 1280 × 720 px, 1280 × 640 px and 1280 × 640 px, respectively, while the resolution of SYNTHIA images has not been changed.
Network Training.We optimize the model with SGD (using a momentum of 0.9 and a weight decay regularization of 5×10 −4 ).The learning rate starts from 2.5×10 −4 and uses a polynomial decay of power 0.9 over 250k steps, as in [40].We used for validation a subset of the original training set to tune the hyper-parameters of our loss components.To tackle overfitting we used some data augmentation strategies: random leftright flipping; white point re-balancing ∝ U([−75, 75]); color jittering ∝ U([−25, 25]) (the last two applied independently in the R,G and B channels) and random Gaussian blur [36], [40].We perform training on a NVIDIA Titan RTX, using a batch size of 2 (1 source and 1 target samples) for 24, 750 steps (i.e., 10 epochs of the Cityscapes train set).We also exploited the validation set for early stopping.
The code developed for this work is publicly available at the following link: https://github.com/LTTM/LSR.

VI. MEAN ADAPTED-TO-SUPERVISED RATIO METRIC
In this section we introduce a novel measure, called mASR (mean Adapted-to-Supervised Ratio), in order to better evaluate the domain adaptation task with respect to the usual mIoU.
The idea behind the new metric sparks from realizing that the mIoU is missing a key component to evaluate an adaptation method: i.e., it does not account for the starting accuracy on the different classes in supervised training.In particular, the objective of domain adaptation is to transfer the knowledge learned on a source dataset to a target one, trying to get as close as possible to the results attainable through supervised learning on the target domain.We design mASR to capture the relative accuracy of an adapted architecture with respect to its target supervised counterpart, which we identify as a reasonable upper bound.Therefore mASR focuses less on the absolute-term performance and more on the relative accuracy obtained by an adapted architecture when compared to traditional supervised training.
We compare the per-class IoU score of the adapted network for each c ∈ C (IoU c adapt ) with the results of supervised training on target data (IoU c sup ) and we compute mASR by: In mASR, the contribution corresponding to each semantic category is inversely proportional to the accuracy of the segmentation model on it in the supervised scenario, thus giving more relevance to the most challenging classes and producing a more class-agnostic adaptation score.Furthermore, notice how the most challenging classes in driving scenarios are typically associated to small objects like traffic lights or pedestrians and bicycles, that are very critical for the autonomous navigation.In this metric, higher means better and when the adapted network has the same performance as supervised training the score is 100%.
As an example, the mASR scores reported in the last two columns of Table I allow to identify at a glance the algorithms that more faithfully match the target performance.
To validate the new metric, we used as reference the supervised training on the Cityscapes dataset and compared it with the training on corrupted versions of the same dataset using the introduced mASR metric to evaluate the relative performances and so, indirectly, the domain shift introduced by the perturbations.In Fig. 3 we identified 5 types of perturbations which are likely to be encountered by an agent moving outdoor (i.e., Gaussian noise, motion blur, snow, fog, brightness) and we set 5 levels of noise intensity as defined by [56].As expected, the higher is the noise intensity and the lower is the adaptation score computed by mASR.Furthermore, we can also have a hint of the most detrimental types of noise for adapting source knowledge: namely, Gaussian noise, snow, motion blur.This can help us identify which set of samples we should consider more in order to obtain a reliable model capable of handling these situations.On the other hand, brightness and fog influence less the final scores.

VII. EXPERIMENTAL EVALUATION
The qualitative and quantitative results achieved by the proposed approach (LSR + ) in various driving contexts will be presented in this section, where it will be compared with several other feature-level approaches (i.e., [27], [32], [57]), with some entropy minimization strategies (i.e., [40], [41]) that have a similar effect on feature distribution, and finally with the conference version of our work [4].The key feature of these approaches is the training efficiency, indeed the addition of such constraints does not increase the computation complexity of the training, differently from hugely expensive generative networks or modified architectures.
Our end-to-end method allows straightforward integration with other strategies, e.g., adversarial approaches at input or output level, or entropy minimization.In order to verify such compatibility, we introduce an additional entropyminimization loss [40] in our setup.We start from considering two widely used synthetic-to-real benchmarks and a standard ResNet-101 as backbone architecture obtaining the results shown in Table I.Then, a real-to-real benchmark [53] has also been used (see Table II).To further verify the robustness of our setup, in Table S1 we report some results using different backbones (i.e., ResNet50, VGG16 and VGG13).

A. Adaptation from Synthetic Data to Cityscapes
When adapting source knowledge from the GTA5 dataset to the Cityscapes one, our approach (LSR + ) achieves a mIoU of 46.9%, with a gain of 10% compared to the baseline and of 0.9% compared to the conference version (LSR) [4], thanks to the refined space-shaping objectives.Moreover, it outperforms all other strategies and the only techniques able to get close to its performance are the recent works by [32] and [40], while the other competitors see a significant score drop.The performance improvement is quite stable across per-class IoUs, and is particularly noticeable in challenging classes, like terrain and t. light where our strategy shows very high percentage gains, and on train where we significantly outperform the competitors by doubling the score of the second-best strategy.
Some qualitative results are reported in the top half of Fig. 4. From visual inspection, we can verify the increased precision of edges in the t.sign, t. light, pole and person classes in both images.Furthermore, our approach is the only one to correctly classify the bus on the right of the first image, which is confused as truck by the other strategies.Importantly, we can also see the effects of our two-pass labeling (see Sec. III-C) on the left of the top image (where part of the fence is correctly classified by our strategy, while being missed by all competitors) and of the second image (where LSR + significantly reduces the confusion between sky and the white building).
In the SYNTHIA to Cityscapes setup, LSR + surpasses its conference version (LSR) by about 1% of mIoU in the 16classes setup and by 0.6% in the 13-classes one, achieving a final score of 42.6% and 48.7%, respectively.It also outperforms all the other competitors, with a slight margin of 1% on average with respect to [32] and a larger one (more than 3%) compared to all the other approaches.
Qualitative results are reported in the bottom half of Fig. 4, where the overall increase in segmentation accuracy for many classes such as car, road and sidewalk is evident.In the first image (third row of Fig. 4) we can see how LSR + is the only strategy to correctly classify both rider and bike, whereas other strategies even miss the t.sign in the foreground.Similarly, in the second image we note improvements on the prediction on such classes and, fundamentally, of the road in foreground (confused for car and bycicle by the competitors).

B. Adaptation from Cityscapes to Cross-City
Besides using synthetic data, another key requirement is the capability of adapting networks trained on road scenes coming from certain geographical areas to other regions.However, the great variability of road scenes across the world limits a wide application of locally-trained models on a global scale.To investigate the capability of our approach to cope with this problem, we evaluate the performance on the Cross-City real-to-real benchmark in Table II.This benchmark is comprised of 4 cities with a completely different type of urban setting: Rome, Rio, Tokyo and Taipei.When evaluated on those setups, our strategy reaches an mIoU score of 56.2%, 52.3%, 50.0% and 50.0%surpassing the source only model by 5.2%, 3.4%, 2.2% and 3.7%, respectively.Importantly, our approach achieves consistent results across the setups (LSR + is the top scorer in 3 out of 4 setups, and second in the remaining one) surpassing the average best competitor score by 0.5% mIoU (52.1% versus 51.6%).We remark that the best competitor changes depending on the setup, being [32], [32], [27] and [27] for Rome, Rio, Tokyo and Taipei, respectively, underlining the unstable performances of many approaches usually associated with this benchmark.Looking at the per-class IoU scores, we can see how our strategy significantly outperforms the competitors in t. light and

C. Results with Different Backbones
Table S1 shows the performance of our strategy on GTAV→Cityscapes using multiple encoder-decoder backbones in order to evaluate the generalization properties of the approach to different network architectures.Here we can see how LSR + outperforms the source-only models (i.e., without adaptation) by 13.3%, 12.7% and 7.8% using ResNet50, VGG-16 and VGG-13, respectively.Even more importantly, we can see how the performance improvement is consistent across all backbones, in opposition to what happens to competing strategies.Finally, we remark the stability of the mASR score of our strategy, hovering around a mean of 57.0% with a very tight standard deviation of 1.4% (the mean values of the other strategies are 48.5% and 42.2%, and the standard deviations are 2.5% and 31.3%, respectively).Per-class IoUs are reported in the Supplementary Material.

VIII. ABLATION STUDIES
In this section, we evaluate the impact of each component of the approach on the final accuracy.Quantitative results are reported in Table IV, where we evaluate our strategy by removing each constraint independently and evaluating the  impact on the final accuracy.In particular, we show how the absence of each of our losses reduces the final performance by a minimum of 0.8% mIoU and an average of 1% mIoU.Each module brings a significant improvement in terms of accuracy and all the components are needed for the best results.Furthermore, the comparison with [4] highlights how the improvements are distributed over all the constraints and how the novel implementation of the space shaping constraints has less overlap with respect to the entropy minimization, resulting in a much higher performance when they are employed in conjunction.

A. Analysis of the Latent Space Regularization
For visualization purposes and for a fair comparative analysis across the classes, the plots of this section are computed on a balanced subset of feature vectors (250 vectors per class) extracted from the Cityscapes validation set.Two-pass prototypes and clustering.To investigate the semantic feature representation learning produced by our approach we computed a shared t-SNE [58] embedding of the prototypes sampled during the training procedure and of the target features produced by the final model.We remind the reader that, in order to more effectively shift target features closer to the source ones, we resort to a two-stage label assignment procedure which recovers target awareness (by averaging target-extracted features) from prototypes computed on the source domain (by centroid computation) as reported in Sec.III-C.In the left plot of Fig. S1 we report the learned prototype trajectory embeddings, and on the right the respective feature vectors.Here we can appreciate how prototypes get farther apart while training goes on and how features extracted from the target domain lie in a neighborhood of the prototype, which we recall is computed exclusively via source-supervision.This underlines the effectiveness of our clustering strategy, which is able to shift the target feature distribution closer to the source one.Finally, to further analyze our clustering objective we produce additional t-SNE embeddings starting from the normalized features (to remove the norm information, focusing on the angular one), which is reported in Fig. 6.Our strategy significantly improves the cluster separation in the embedded space and increases the spacing between clusters belonging to different classes, promoting features' disentanglement.This cross-talk reduction is also reflected in the decreased probability of confusing visually similar classes (e.g., the truck class with the bus and train ones).
Finally, PCA embeddings are reported in the Supplementary Material to evaluate the effect of latent-spacing techniques when projected to a lower dimension via a linear function.
Weighted histogram-aware downsampling.In this work, we extended the scheme proposed in [4] by adding class weights inversely proportional to the class-frequency in the training dataset (see Sec. III).Our goal is to provide labeling only to spatial locations in feature maps where a clear class association can be performed, by relying on a frequency-aware scheme.By doing so, we seek for the disentanglement of activations belonging to different classes, even when their feature vectors are neighbors in a given label map.This effect can be noted in Fig. 7, where our downsampling algorithms enhanced with frequency-awareness are able to identify some feature locations close to class edges as unlabeled in the downsampled label map (middle and right), keeping only faithful features.As expected, class-weighting (right plot of Fig. 7) promotes rarer classes at the feature level compared to the version without it [4] (middle plot of Fig. 7): for instance, compare the traffic sign (in yellow).Further evidence of this can be found in the class distribution of segmentations maps (computed after their downsampling to the latent space spatial resolution), which we reported in Fig. 8 for our weighted histogram-aware scheme, the previous un-weighted histogram-aware scheme of the conference version [4] and the standard nearest neighbor.In particular, the schemes based on histogram-awareness generally seldom preserve small object classes, promoting unlabeled classification when discrimination between classes is uncertain.Our weighted histogram-aware scheme improves uniformity across rarer or smaller semantic categories, which were over-penalized by the previous approach [4], where all classes were treated equally, regardless of their occurrence.Perpendicularity is analyzed in Fig. 9 where we display the average angular distance between each prototypes and all the remaining ones.Our goal is to achieve prototype perpendicularity, such that we minimize the the overlap (i.e., cross-talk) among distinct semantic categories over feature activations.By the red dashed line we highlight the upper bound to the angular distance, which is set to 90 degrees since we assume feature vectors to have non-negative entries.
From the figure, it emerges clearly that LSR-based approaches increase the inter-prototypical angle and that LSR + makes prototypes even more orthogonal with an improvement of more than 2 degrees on average.
Norm Alignment is analyzed in Fig. 10, where we show the mean channel entropy for each class.We observe that the entropy corresponding to feature vectors produced by LSR + is significantly reduced, meaning that features are characterized by more relevant peaks and fewer poorly-activated channels.

IX. CONCLUSIONS
In this work, we tackled the generalization of road scene segmentation models by introducing a set of latent-space regularization strategies for unsupervised domain adaptation.We improved domain invariance using different latent spaceshaping constraints (i.e., class clustering, class perpendicularity and norm alignment), to space apart features belonging to different classes while clustering together features of the same class in a consistent way on both the source and target domain.To support their computation, we introduced a novel target pseudo-labeling scheme and a weighted label decimation strategy.Results have been evaluated using both the standard mIoU and a novel metric (mASR), which captures the relative performance between an adapted model and its target supervised counterpart.We outperformed state-of-theart methods in feature-level adaptation on two widely used synthetic-to-real road scenes benchmarks and in real-to-real setups, paving the way to a new set of feature-level adaptation strategies capable to improve the discrimination ability of road scene understanding approaches.
Future work will focus on designing novel feature-level techniques and on evaluating their capability of generalizing to various tasks in driving scenarios.The adaptation from multiple source domains to multiple target ones will also be considered together with the application also to multimodal data (e.g., LIDARs or depth cameras) mounted on cars.

Fig. 1 :
Fig.1: Visual representation of our two-pass feature vector classification strategy.The initial source-based classification (in blue) can lead to erroneously classified target samples (purple shaded areas).This problem is tackled by computing target prototypes as the centroids of the partitioned vectors (notice the shift compared to the original source prototype), these prototypes are used as new classification centers (green boundary), producing a correct segmentation.

Fig. 2 :
Fig. 2: Visual summary of our strategy.Features are associated to semantic classes and prototypes are computed from them (III).Class clustering (IV-A), prototypes perpendicularity (IV-B) and norm alignment and enhancement (IV-C) are the three proposed space-shaping constraints.Additonally, we apply on top an entropy minimization objective [40].

Fig. 3 :
Fig. 3: mASR score as a function of the injected noise intensity.

Fig. 5 :
Fig. 5: t-SNE embedding of the target feature vectors.On the left: trajectories of prototypes sampled over 200 training steps.On the right: features produced by the final model embedded according to the shared t-SNE projection.

Francesco
Barbato received the M.Sc.degree in Telecommunication Engineering from the University of Padova in 2020.He is currently a Ph.D. Student the same University.His research focuses on domain adaptation and continual learning applied to computer vision tasks, particularly to semantic segmentation for autonomous vehicles.Umberto Michieli received the M.Sc.degree in Telecommunication Engineering from the University of Padova in 2018.He is currently a final-year Ph.D. student at the same University.In 2018, he spent 6 months as a Visiting Researcher with the Technische Universität Dresden.In 2020 he interned as Research Engineer for 8 months at Samsung Research UK.His research focuses on transfer learning techniques for semantic segmentation, in particular on domain adaptation and on incremental learning.Marco Toldo Marco Toldo received the M.Sc.degree in ICT for Internet and Multimedia in 2019 at the University of Padova.At present, he is doing his Ph.D. at the Department of Information Engineering of the same university.He is also doing an internship as Research Engineer at Samsung Research UK.His research involves domain adaptation and continual learning applied to computer vision.Pietro Zanuttigh received a Master degree in Computer Engineering at the University of Padova in 2003 where he also got the Ph.D. degree in 2007.Currently he is an associate professor at the Department of Information Engineering.He works in the computer vision field, with a special focus on domain adaptation and incremental learning in semantic segmentation, 3D acquisition with ToF sensors, depth data processing, sensor fusion and hand gesture recognition.
e t a t i o n T e r r a i n S k y P e r s o n R i d e r C a r T r u c k B u s T r a i

Fig. S1 :
Fig. S1: Prototypes trajectories and target feature vectors projected via PCA.Projection is 3-dimensional; here we report the three xy, xz and yz planes.

TABLE I :
Comparison of adaptation strategies in terms of IoU, mIoU and mASR (Sec.VII).Best in bold, runner-up underlined.mIoU1andmASR 1 restricted to 13 classes, ignoring the classes with same superscript.

TABLE II :
Quantitative results on the Cross-City real-to-real benchmark.(r)indicates that the strategy was re-trained, starting from the official code.Best in bold, runner-up underlined.rider in the Cityscapes→Rome setup (increase of 6% of IoU), in person and rider in the Cityscapes→Rio setup (increase of 4% of IoU) and in motorbike in the Cityscapes→Taipei setup (increase of 9.3% of IoU).Qualitative results on this benchmark are presented in the Supplementary Material.

TABLE III :
Additional quantitative results with multiple backbones, GTAV→Cityscapes setup.(r) indicates that the strategy was re-trained, starting from the official code.

TABLE IV :
Ablation studies, mIoU and mASR scores comparison when removing any of the losses.Implementations of losses from[4]are compared with the new ones in this work.