1 Introduction

Machine learning plays a crucial role in various applications such as face recognition, self-driving cars, and healthcare (Roth et al. 2018; Rao and Frtunikj 2018; Guo and Zhang 2019), where a model needs to provide predictive decisions on the basis of the knowledge it acquired in the training stage. Nevertheless, what will happen if an autonomous system trained in California roads is tested on New York roads? How about a machine learning model that has been trained to perform sentiment analysis using data from the USA that is being used to interpret the sentiment of the posts from the United Kingdom? Again, Can a tumor detection model trained on one group of patients work well for finding tumors in a diverse group with different health issues and backgrounds?

The answer to these questions is that those models will not work well based on the given situation. This arises from the assumption made by conventional machine learning (ML) techniques that the source and target data will be characterized by independence and uniformity. However, this assumption is not always fulfilled in reality. Often, data comes from different distributions introducing an issue known as domain shift (Ben-David et al. 2010; Blanchard et al. 2021; Moreno-Torres et al. 2012; Recht et al. 2019; Taori et al. 2020). Hence ML model experience notable decrease in performance when dealing with out-of-distribution (OOD) target domain (Fig. 1).

Fig. 1
figure 1

Domain generalization for semantic segmentation. Examples from GTAV and ACDC datasets with different conditions such as Night and Fog for domain generalization, where the model is trained on the GTAV dataset, and directly evaluated in the ACDC dataset

The issue of domain shift presents a substantial threat to the scalability of machine learning models across diverse applications within the field of Computer Vision. One such example is semantic segmentation (SS), a crucial computer vision task where each pixel in an image is categorized into a particular class (Guo et al. 2018). It has immense application for numerous applications such as autonomous driving, medical imaging, image editions, etc. Building a robust semantic segmentation model that can work well in unfamiliar situations (new unseen domains) is essential. A direct approach to address the domain shift challenge is to gather data from every possible domain which is both costly and practically unfeasible. Another alternative approach involves data collection from the target domain to adapt the trained model, referred to as domain adaptation, on the source domain. It is not always feasible in real-world scenarios (Wang et al. 2022a). In many cases, it is difficult to gather or even unknown before deploying the model, e.g., in Biomedical applications where it is impractical to collect new patient data in advance. So it is crucial to enhance the model’s generalization capability. To address this issue without requiring data from the target domain, domain generalization (DG) was introduced (Muandet et al. 2013). The aim is to enhance the generalization capability of machine learning models by leveraging one or more related yet distinct source domains. In recent developments, domain generalization has been applied to advance the field of semantic segmentation. There are a few survey papers that are based on domain generalization (Zhou et al. 2022a; Wang et al. 2022a), but they are not focused on a specific application of domain generalization but rather a general survey of domain generalization. This survey (Zhou et al. 2022a) mentioned the application of domain generalization in semantic segmentation. However, it does not hold a broad and comprehensive understanding of DG in semantic segmentation. Another survey (Li et al. 2023a) discussed the transformers for the segmentation task. There are several representative methods for semantic segmentation such as query-based and close-set segmentation methods (Cheng et al. 2020; Yu et al. 2018; Li et al. 2020b; Kirillov et al. 2020; Li et al. 2020b; Zhang et al. 2021a; Wang et al. 2021).

This paper presents the first comprehensive survey on Domain Generalization for Semantic Segmentation. We aim to introduce its recent advances, emphasizing its formulations, theories, algorithms, research areas, datasets, applications, and potential future research directions. We anticipate this survey will provide a comprehensive review for researchers interested in this topic and spark more research in this and related areas. There are several survey papers on domain generalization(DG) and semantic segmentation separately. However, to the best of our knowledge, this is the first paper addressing domain generalization in the context of semantic segmentation. Our contributions are summarized as follows.

  • To the best of our knowledge, our survey is the first paper that comprehensively reviews domain generalization for semantic segmentation, which recently has caught growing attention in many computer vision applications.

  • We discuss the widely used datasets, and evaluation metrics, and provide a quantitative comparison of the backbone segmentation models in different DG approaches.

  • We provide future challenges and research directions that can be aggregated to solve underlying challenges in generalized semantic segmentation.

The rest of the paper proceeds as follows: Sect. 2 provides the necessary background. While Sect. 3 touches on related sub-areas. We explore various methodologies addressing Domain Generalization in Sect. 4 and discuss relevant datasets, benchmarks, and evaluation methods in Sect. 5. Section 6 houses a broad discussion on the future research directions. The paper concludes with Sect. 7.

2 Background

2.1 Problem formulation

Domain generalization or OOD generalization refers to signify the generalization capability on unseen target domains, and also in source domains. Here the target domains are denoted as \({\mathcal {T}}\) = \(\{{\mathcal {T}}_1,.....,{\mathcal {T}}_N\}\). Usually, there are multiple sources \({\mathcal {S}}\) = \(\{{\mathcal {S}}_1,.....,{\mathcal {S}}_K\}\) to train and learn invariant semantic features. A semantic segmentation model \(\phi\) outputs pixel-wise predictions p for given an image x. This semantic segmentation model consists of a feature extractor \(\phi _{ext}\) and classifier \(\phi _{cls}\). While training the segmentation network, we have access to multiple source domains \({\mathcal {D}}_s\) = \(\{(x^s, y^s\}\). Where, \({\mathcal {D}}_s\) have multiple source domains \({\mathcal {S}}\). Here, each sample \(x^s \in \mathbb {R}^{H\times W \times 3}\), corresponding pixel-wise labels \(y^s \in \mathbb {R}^{H \times W \times 3}\). The segmentation loss for the baseline network \(\phi\) can be calculated as,

$$\begin{aligned} {\mathcal {L}}_{ss} = - \frac{1}{HW} \sum _{h,w,k=1}^{H,W,K} y^s log(\phi (x^s)) \end{aligned}$$
(1)

The main goal is to minimize the source domain loss to ensure high generalization on unseen target domains \({\mathcal {T}}\), each \({\mathcal {T}}\) is an unlabelled dataset \({\mathcal {D}}_t\) = \(\{(x^t)\}\). Traditionally, the segmentation model can be evaluated in both source \({\mathcal {S}}\) and unseen target domains \({\mathcal {T}}\).

3 Deep learning methods in semantic segmentation: CNN and transformers

Recently, deep learning-based methods played an important role in semantic segmentation tasks. Semantic segmentation is also referred to as visual understanding. In recent times, CNN and vision transformer-based methods have mostly been used to solve challenges in semantic segmentation. In this section, we review some of the recent methods of CNN and vision transformer for semantic segmentation. DeconvNet (Noh et al. 2015) represents a significant contribution to this field, offering an approach that complements the conventional Fully Convolutional Network (FCN)-based methodologies, known for their proficiency in extracting a generalized form of objects. Contrasting the FCN, DeconvNet systematically organizes proposals by size, efficiently capturing multi-scale objects and discerning finer object details. Notably, the innovation of SegNet (Badrinarayanan et al. 2015, 2017) lies in its novel approach to upsample feature maps with low spatial dimensions within the decoder. Furthermore, SegNet incorporates a mechanism to retain the max-pooling indexes from the encoder’s feature maps, bolstering its overall performance. Reference (Kendall et al. 2015) introduces a pixel-based probabilistic framework termed Bayesian SegNet, achieved by adapting the architecture of SegNet. This adaptation involves the implementation of a probabilistic encoder-decoder architecture using dropout (Srivastava et al. 2014), a technique also utilized in approximate inference by Bayesian CNN (Gal and Ghahramani 2015). Besides CNN, transformers are also extensively used in semantic segmentation. Strudel et al. (2021) used a vision transformer for semantic segmentation, they utilized output embeddings corresponding to image patches and used class labels from the embeddings with a mask transformer. There are other methods that use transformers (Xie et al. 2021; Zheng et al. 2021; Zhang et al. 2022b) for semantic segmentation.

4 Sub-related topics

The connections and differences between the DG on Semantic Segmentation and its related topics are addressed in this section.

Domain adaptation (DA) is an approach for improving a model’s performance on a target domain with insufficient annotated data by using the information that the model has learned from a corresponding domain with enough labeled data. The goal of domain adaptation is to lessen disparities in the feature space across domains, both marginal and conditional. This involves identifying common underlying attributes shared involving the source and destination domains and subsequently adjusting them to enhance alignment. In other words, domain adaptation tries to lessen the detrimental impacts of domain shift, which can result in a decline in model performance for semantic segmentation when the model is applied to data from a distinct distribution. It is the subject that is most relevant to DG and has been widely researched in the literature. Wang et al. (2023) Yang et al. (2022c) Xie et al. (2023) Toldo et al. (2022) Shyam et al. (2022b). Wang et al. (2023) proposed a target-to-source DA technique to encourage the model to learn comparable cross-domain properties using a dynamic re-weighting strategy to help the model. Yang et al. (2022c) gave the idea of a framework that is easy to train and learns domain-invariant prototypes for domain adaptive semantic segmentation. In order to encourage the learning of class-discriminative and class-balanced pixel representations across domains and ultimately improve the performance of self-training methods, Xie et al. (2023) introduced an innovative concept named Semantic-Guided Pixel Contrast (SePiCo). A unique one-stage adaptation model that emphasizes the semantic ideas contained in each individual pixel. Toldo et al. (2022) introduced that when learning incremental tasks, style transfer strategies are used to expand knowledge between domains, and a strong distillation framework is used to successfully remember task information under incremental domain change. In order to improve an underlying segmentation network such that it consistently performs in unidentified actual destination domains, Shyam et al. (2022b) proposed the notion of utilizing a large number of synthetic source domains. Yang et al. (2023b) recommended a unique Sparse Visual Domain Prompts (SVDP) method in order to overcome domain shift issues in semantic segmentation. It realizes effective cross-domain learning and seeks to extract more regional domain-specific information. Self-ensembling models, provide a different perspective on how to learn domain-invariant properties and introduce domain adaptability for semantic segmentation (Xu et al. 2019). He et al. (2021) offered an interactive learning method for domain adaptation without investigating any data from the desired domain to make full use of the vital semantic knowledge across source domains. The basic goal of domain generalization is to develop a model that can perform well on destination domains that were not encountered during training. The objective is to generalize across multiple domains. The similarities between DG and DA on semantic segmentation are the existence of domain shifting and the transfer of knowledge between source and destination domains. On the contrary, DG deals with an unseen target domain and DA addresses a known target domain. There are different settings proposed such as source-free DA, where source-domains are not utilized during the testing. Liu et al. (2021b) proposed a distillation-based source-free domain adaptation method that preserves the source knowledge via knowledge transfer to retain contextual feature relationships for semantic segmentation. Yang et al. (2022a) suggested a source-free domain adaptation method that is based on self-training and distribution transfer by aligning implicit feature representation of the source model. Kundu et al. (2021) proposed a source-free DA method based on pseudo labeling that is generated by a multi-head framework. On top of it, they proposed a conditional prior-enforcing auto-encoder to retain high-quality pseudo labels in the target domain. You et al. proposed a source-free DA method based on positive and negative learning, where the main mechanism is to select class-balanced pseudo-labeled pixels, where negative learning does the heuristic complementary label selection. Some other related DA methods such as unsupervised DA (Zou et al. 2018; Zhang et al. 2019; Lee et al. 2021; Sankaranarayanan et al. 2017), semi-supervised DA (Chen et al. 2021b; Wang et al. 2020b; Hoyer et al. 2023), and few-shot DA (Kalluri and Chandraker 2022; Lei et al. 2022) also solved the similar semantic segmentation problem.

Self-supervised learning (SSL) aims to tackle problems by pretraining a general model with an enormous quantity of unlabeled data and subsequently tuning it on a downstream task with a limited amount of labeled data (Ziegler and Asano 2022). Its effectiveness is its capacity to make use of enormous quantities of unlabeled data and build accurate representations that highlight certain patterns and structures in the data. It can be used to pre-train models and learn general-purpose representations, which can be transferable and useful for domain generalization. The choice of the self-supervised task and the domain difference among the pretraining and evaluation datasets play a crucial factor in determining the model’s success in SSL. It seeks to enhance performance on a particular task, such as semantic segmentation, by utilizing unlabeled data through auxiliary tasks. On the other hand, domain generalization focuses on making the model robust to handle different and unseen data distributions, enabling it to perform well in diverse real-world scenarios.

Semi-Supervised Learning (SeSL) is a branch of machine learning that emphasizes carrying out particular learning tasks using both labeled and unlabeled data (Van Engelen and Hoos 2020). The segmentation performance is further enhanced by including prediction filtering into the already established SWSSS algorithms (Bae et al. 2022). A semi-supervised framework built on Generative Adversarial Networks (GANs) was suggested by Souly et al. (2017) to ensure improved quality of images for GANs and subsequently better pixel classification. These approaches were tested on various challenging comparative visual datasets, i.e. PASCAL, SiftFLow, Stanford, and CamVid. A boundary-optimized co-training (BECO) method has been implemented to train the segmentation model in consideration of the noisy pseudo-labels, and WSSS should be converted to robust learning Rong et al. (2023). Kweon et al. (2023) proposed a completely new WSSS framework via adversarial learning of a classifier and an image reconstructor. To address the noise label and multi-class generalization issues, Chen et al. (2023a) suggested an end-to-end multi-granularity noise reduction and bidirectional alignment (MDBA) model. With simple-to-complex picture synthesis and complex-to-simple adversarial learning, this approach is suggested to close the data distribution difference in both input and outcome space. An integrated transformer architecture was proposed by Lian et al. for learning two modalities of class-specific tokens, i.e., class-specific visual and textual tokens. In semantic segmentation, domain generalization involves the procedure of training a model to perform well on semantic segmentation tasks across various source domains to improve its ability to generalize well to an unknown destination domain. The primary distinction between SSL and DG is that semi-supervised learning often assumes that the unlabeled data comes from exactly the identical distribution as the labeled data.

Multi-Task Learning (MLT) is a machine learning paradigm that attempts to capitalize on valuable knowledge from a variety of associated tasks to enhance the generalization efficiency of all the tasks (Zhang and Yang 2021). While DG aims to generalize a model to unknown data distribution, MTL aims to improve a model’s performance on the exact same set of tasks that the model was trained on. Using knowledge obtained from numerous diverse independent data sources, Graham et al. (2023) proposed a multi-task learning method for segmenting and categorizing nuclei, glands, lumina, and various tissue regions. Bischke et al. (2019) dealt with the issue of maintaining semantic segmentation borders in high-resolution satellite imagery by using a recent multi-task loss methodology. The bias resulting from the loss causes the network to give greater attention to pixels close to boundaries by using several output descriptions of the segmentation mask. Semantic segmentation performance is improved by multi-task self-supervised learning with no additional annotation or inference-related computing costs (Novosel et al. 2019). Lu et al. (2020) suggested model can learn segmentation and per-pixel depth regression from a single image input by using multi-task learning. Researchers introduced a novel approach to simultaneously estimate disparity maps and segment images by combining the training of an encoder-decoder-based interactive convolutional neural network (CNN) for single image estimation of depth and a multiple class CNN for image segmentation. In order to improve the super-resolution model toward generating images that are most appropriate for the purpose of segmentation rather than ones that are of high fidelity, Aakerberg et al. (2021) introduced an approach that jointly trains a high-resolution and semantic segmentation model from beginning to end manner using exactly the same task loss for both models. In parallel, researchers updated the segmentation model to more effectively use the enhanced images and raise segmentation accuracy. An innovative multi-task learning technique for the categorization of tumors in ABUS images implementing an encoder-decoder network and a lightweight multi-scale network has been developed (Zhou et al. 2021). A new sharing unit called a cross-stitch unit, that can be trained end-to-end, combines the activations from several networks (Misra et al. 2016). The goal of multi-task learning for semantic segmentation is to jointly build a model to carry out a variety of segmentation-related tasks by utilizing shared representations. By employing data to train a model from many source domains, domain generalization for semantic segmentation tries to make a model resilient to domain transformation and enable it to function well on an unknown destination domain. In order to enhance the performance of the model, both strategies use shared knowledge, but they focus on different problems: task diversity in multi-task learning and domain shift in domain generalization.

Transfer Learning (TL) focuses on transferring knowledge from one (or more) problem/domain/task to another but associated one (Pan and Yang 2009). Fine-tuning is a widely recognized example in contemporary deep learning: pre-train deep neural networks on enormous datasets, such as ImageNet (Deng et al. 2009) for vision models or BooksCorpus (Zhu et al. 2015) for language models, and then improve them on subsequent tasks (Girshick et al. 2014). To bridge the gap between the large source domain and the constrained destination domain, Sun et al. (2019) suggested a technique that makes use of transfer learning for semantic segmentation. It adapts to the destination domain using both actual and synthetic images as learning sources. Without taking into account the supine or prone positions, Ham et al. (2023) suggested that Semantic Segmentation uses transfer learning of convolutional neural networks to perform robust breast segmentation in supine breast MRI. Yang et al. (2021) suggested an effective semantic segmentation technique that makes use of the feature extractor of a real-time object detection model. Nigam et al. (2018) presented a new dataset and suggested a successful method for comparing train and test distributions with totally distinct scene organization, views, and object statistics. A common transfer learning strategy is pretraining-finetuning, in which the tasks for the source and destination domains are different and the destination domain can be accessed during training. The training and test tasks are typically the same despite having different distributions, and the target domain is not available in DG. In contrast to DG, which assumes no access to the target data and instead focuses on model generalization, TL requires the target data for model fine-tuning for new downstream tasks.

Few-Shot Meta-Learning (FSML) is a machine learning technique that uses a minimal number of labeled samples per class to guarantee that a pre-trained model generalizes across new types of data (that the pre-trained model has not seen in training). It is unique compared to traditional supervised learning. Traditional supervised learning methods require a huge amount of labeled training data. The test set also contains samples of data that must have a similar statistical distribution and come from the same categories as the training set. However in the case of FSML, even if the model was pre-trained using a statistically different distribution of data, the model can be used to expand to additional data domains as long as the data in the support and query sets are coherent. Pambala et al. (2021) proposed Semantic MetaLearning (SML), a modern meta-learning system that builds prototypes for a select group of annotated training images that includes class-level semantic descriptions. Tian et al. (2020) introduced the MetaSegNet framework for multi-object segmentation. In order to extract the appropriate meta-knowledge for the few-shot segmentation, an embedding module architecture composed of the global and local feature branches was developed. A novel Cycle-Resemblance Attention (CRA) module has been added to a special self-supervised few-shot medical image segmentation network in order to make full use of the pixel-wise relationship between the query and support medical pictures (Ding et al. 2023). In order to conquer the difficult CD-FSS problem, Lei et al. (2022) introduced a novel Pyramid-anchor-transformation-based few-shot segmentation network (PATNet) that converts domain-specific attributes into domain-agnostic ones for downstream segmentation modules to quickly adapt to unknown domains. For learning semantic alignment with query features, Chen et al. (2021a) presented a class-specific blueprint and a class-agnostic blueprint and produced complete sample pairs. Li et al. (2021c) proposed method produces arbitrary pseudo-classes at random in the background of the query photos, supplying additional training data that is not available when forecasting particular target classes. The objective of domain generalization is to improve the robustness and generalization ability of models across various domains by addressing domain shifts. The similarity between DG and FSML is that, both the strategy increase the generalization ability of models. But in the case of FSML, it focuses on adjustment on new tasks while in DG it enhances the competence of the model to perform well on unknown data distribution.

5 Methodology

There are three categories in domain generalization (Wang et al. 2022a), such as (a) Data Manipulation, where it manipulates the input for better learning of the data, e.g. data augmentation, generation, normalization, and randomization fall into this category. (b) Representation Learning, which is apparently the most popular category, e.g. Domain invariant feature representation and feature disentanglement, where features are disentangled for domain-specific learning, and lastly, (c) Learning Strategy, which focuses on general learning capabilities to improve generalization, e.g., meta-learning, self-supervised learning. As mentioned, these categories are also divided into sub-categories. In this section, we provide a detailed explanation of existing domain generalization (DG) methods for semantic segmentation (SS). Figure 2 depicts the structure of the categories of domain generalization.

Fig. 2
figure 2

Taxonomy of domain generalization methods

5.1 Data augmentation

Augmentation techniques have been found in extensive use in supervised learning for training machine learning models to reduce overfitting problems by enhancing the generalization performance of a model (Honarvar Nazari and Kovashka 2020; Shorten and Khoshgoftaar 2019; Khosla and Saini 2020; Yang et al. 2022b). The fundamental concept involves augmenting the original pairs (xy) with new pairs (A(x), y), where A(x) denotes the transformation applied to x. Invariably they can be adopted for DGSS. In their work, Xu et al. introduced a novel data augmentation strategy called “amplitude mix,” which relies on Fourier-based techniques. This method involves interpolating between the amplitude spectrums of two images in order to preserve phase information (Xu et al. 2021). Su et al. proposed SLAug, Saliency-balancing Location-scale Augmentation (LLA) comprising Global scale Augmentation(GLA) for increasing source-like images through global distribution shifting and LLA for conducting class-specific augmentation (Su et al. 2023). Inspired by topology-altering augmentation techniques (Chen et al. 2019; Dwibedi et al. 2017; Kumar Singh and Jae Lee 2017; Yun et al. 2019), Sellner et al. (2023) demonstrated Organ Transplantation to address geometric domain shifts based on application-specific data augmentation. Based on adversarial style augmentation, Zhong et al. (2022) introduced an innovative augmentation approach named AdvStyle. This technique generates challenging stylized images during the training process, effectively countering overfitting on the source domain. Kim et al. (2023a) proposed LiDAR semantic segmentation(DGLSS) by augmenting domains with diverse sparsity. Shyam et al. (2022a) introduced a style mixing augmentation that leads to features belonging to the same category having different styles. To address blind feature alignment, Shen et al. (2023) proposed a cross-domain mixture data augmentation technique. Zhao et al. (2022a) proposed a clustering instance mix(CINMix) augmentation technique to diversify the layouts of the source data. Lyu et al. (2022) introduced an approach called Automated Augmentation for Domain Generalization(AADG). This work aimed to create novel domains through a proxy task to enhance diversity in the context of retinal image segmentation.

5.2 Domain randomization

Domain randomization (DR) is a technique for improving the generalization ability of ML models to new domains. This work involves the stochastic generation of synthetic data encompassing a wide range of potential domains. This contributes to learning domain invariant features such as lighting, object pose, and background clutter. Wu et al. (2022) proposed a “SiamDoGe” segmentation method that hinges upon a feature randomization technique with the objective of learning domain invariant features. Gong et al. (2022) formulated a strategy known as Class Mixed Sampling Intermediate Domain Randomization(CIDR) which works between source and pseudo-target domain. Peng et al. (2021) introduced Local Texture Randomization(LTR) and Global Texture Randomization(GTR) to induce randomization into the texture of source images for the diversification in texture styles. Xiao et al. (2023) designed PointDR that alternatively randomizes the geometry styles of the point clouds and aggregates their embeddings for the purpose of broadening training point cloud dataset distribution for 3D segmentation.

5.3 Domain generation

Data generation is a technique for improving the generalization of machine learning models to novel domains. This is achieved through the generation of synthetic data covering a diverse range of domains. Chen et al. (2023b) proposed a Generative Semantic Segmentation(GSS) model based on Vector Quantized Variation AutoEncoder(VQVAE). Li et al. (2021a) introduced an innovative generative framework built upon StyleGAN2 (Karras et al. 2020). It is tailored for addressing semantic segmentation tasks utilizing generative models with joint image-label distribution. Zhao et al. (2022b) proposed Style-Hallucinated Dual Consistency learning(SHADE) framework. It was introduced to address domain shift challenges in the context of semantic segmentation.

5.4 Domain adversarial learning

Domain adversarial learning can be used for semantic segmentation for learning domain invariant features. Ganin and Lempitsky (2015) first introduced Domain-Adversarial Neural Network(DANN) with the objective of adaptation between source and target domain. In the architecture, a single network accommodates both the generator and discriminator. The generator tries to fool the domain classifier and the domain classifier forces the generator to extract domain-invariant features. Tjio et al. (2022) proposed an adversarial semantic hallucination(ASH) approach with the aggregation of a class-conditioned hallucination module and a semantic segmentation module. Similar to the generator and discriminator, the segmentation module and hallucination module challenge each other to boost the generalization capability of the model. Xu et al. (2022a) proposed an adversarial framework for organ segmentation from a single domain to ensure semantic consistency through contrastive learning with Mutual information regularizer. To improve cooperation between domains, Zhang et al. (2023) introduced MTDA, a self-training method combining pseudo-labeling and feature stylization. Xu et al. (2022a) also proposed a novel adversarial DG method for organ segmentation trained on a single domain. A novel component Adversarial Domain Synthesizer(ADS) was incorporated to enable effective training on a single domain in the presence of domain shift. GAN-based method presented by Sankaranarayanan et al. (2018) to align the source and target data samples in the latent feature space.

5.5 Self supervised learning

Self Supervised Learning (SSL) can be used to improve generalization. The key idea is that a model learns generic features regardless of the target task by solving pretext tasks. Without the need for any domain labels, it can be used for semantic segmentation in single and multi-source settings (Zhou et al. 2022a). Vertens et al. (2020) proposed a multimodal semantic segmentation model utilizing a teacher-student training approach that transfers knowledge from the daytime domain to the nighttime domain. Yang et al. (2023a) proposed a Domain Projection and Contrastive Learning(DPCL) approach including self-supervised domain projection(SSDP) and multi-level contrastive learning(MLCL). SSDP aims to lessen the domain gap by projecting to the source domain. Zhou et al. (2022b) presented a multi-task paradigm with domain-specific image restoration(DSIR) module employing self-supervision.

5.6 Meta learning

Meta-learning is referred to as “learn to learn”. It can quickly adapt to new tasks with limited data by learning from a variety of tasks. The goal of meta-learning is to use prior knowledge from the learned tasks to handle new tasks efficiently. Since it can be employed to increase generalization, it can be used for semantic segmentation tasks learning from a variety of complex scenarios. Kim et al. (2022) presented a memory-guided domain generalization method based on a learning framework. Zhang et al. (2022a) introduced a novel domain for semantic segmentation that takes advantage of model-agnostic learning. Dou et al. (2019) adopted a model-agnostic learning paradigm with gradient-based meta-learning. They introduced a pair of complementary losses designed to effectively regulate the semantic structure of the feature space. Gong et al. (2021) proposed a meta-learning-based strategy for addressing Open Compound Domain Adaptation(OCDA) in the context of semantic segmentation. Shiau et al. (2021) addressed domain generalized semantic segmentation by proposing a novel meta-learning scheme with feature disentanglement ability. Zhang et al. (2022a) developed a domain generalization framework that jointly exploits the model-agnostic training scheme and target-specific normalization test strategy for semantic segmentation tasks. Qiao et al. (2020) introduced adversarial domain augmentation to counter the OOD generalization problem by leveraging the meta-learning framework.

5.7 Feature disentanglement

Feature disentanglement refers to the process of separating the factors of variation by breaking down the learned representations of the data. In the context of DG, it can be used to separate the domain-specific and domain-invariant features in the data. It focuses on the features that vary across domains by learning domain-invariant features. Jin et al. (2021) designed a Style Normalization and Restitution module(SNR) where disentanglement aims at better restitution. Bi et al. (2023) proposed a novel mutual information(MI) based framework to disentangle the anatomical and domain feature representations. Similar to this work (Bi et al. 2023), Li et al. (2021b) utilized MI-based disentanglement representation for left atrial(LA) segmentation.

5.8 Feature normalization

Feature normalization is a process to standardize data into uniform and stable distribution without extra data (Liu et al. 2023). Liu et al. (2023) proposed the spectral-spatial normalization(SS-Norm) module to enhance the generalization ability of the model. Bahmani et al. (2021) enhanced the inference procedure with normalization layers.

5.9 Domain invariant

The main objective of domain invariant representation-based approach is to learn domain invariant features from source(s) that will be applied to target as well. By leveraging general semantic shape priors, Liu et al. (2022) presented a novel approach Test-time Adaptation from Shape Dictionary (TASD) to overcome the single domain generalization problem for medical image segmentation. Xu et al. (2022) proposed Domain-invariant Representation Learning(DIRL) algorithm to realize the quantification and utilization of the feature prior to urban-scene segmentation. Liao et al. (2023) introduced a domain generalization approach for semantic segmentation exploiting edge and semantic layout reconstruction to clarify content information. He et al. (2023) designed Patch Statistical Perturbation (PSP) to enhance the patch diversity facilitating the model to learn features that are domain invariant.

5.10 Pseudo label

Pseudo-label can be used to leverage unlabeled data for the target domain in domain generalization. The aim of Pseudo label DG for semantic segmentation is to enhance the quality of pseudo labels. This enables the model to be generalized well in unknown domains. Zhang et al. (2023) established Multi-Target Domain Adaptation (MTDA) framework which leverages implicit stylization and pseudo-labeling based on self-training to improve alignment between target domains. Kim et al. (2023b) presented WEDGE scheme to use the web-crawled images with their predicted pseudo labels for semantic segmentation. Yao et al. (2022) suggested a confident-aware cross pseudo supervision algorithm and Fourier transformed-based data augmentation to improve the quality of pseudo labels for unlabeled images from unknown distributions. Fourier transformation helps to obtain low-level static information and augment the image data using cross-domain information. Confidence-aware regularization helps to measure pseudo variances which can be used as a quality factor. Kundu et al. (2021) developed a conditional prior-enforcing auto-encoder to aid the client-side self-training. Hoyer et al. (2022) proposed a UDA-based method DAFormer. It comprises three ways where the quality of the pseudo-labels is improved by reducing the confirmation bias of self-training towards common classes through uncommon class sampling on the source domain.

5.11 Style transfer

Style transfer is a technique that is used to change image style while maintaining the content of the image. It is possible to build an overlap between source and target domains in the context of domain generalization (Su et al. 2022). Su et al. (2022) introduced a novel framework to perform an effective stylization with the preservation of fine-grained semantic clues for semantic segmentation. Wang et al. (2022b) proposed Feature-based Style Randomization(FSR) which helps to produce random styles to enhance the model robustness. Lee et al. (2022) proposed feature styliztion, content extension learning, style extension learning, semantic consistency regularization by increasing both the content and style of the source domain to the wild. Zhao et al. (2022b) proposed (SHADE) based on two components Style Consistency(SC) and Retrospection Consistency(RC) to address domain shift. Gong et al. (2019) presented domain flow generation(DLOW) model, which is able in order to convert photos from the source domain into a random intermediate domain between the source and target domains. Fantauzzo et al. (2022) introduced FedDrive, a federated learning approach in semantic segmentation combined with style transfer techniques to improve their generalization.

6 Medical segmentation

In this section, popular applications for domain generalization (DG) in medical segmentation are discussed. Semantic Segmentation is widely used in medical imaging for precise diagnosis of diseases. Domain shift may pose a challenge for semantic segmentation. So there are ample applications of semantic segmentation based on domain generalization in the medical domain. Luo et al. (2023) proposed a single DG framework that is based on dual-level mixing for fundus image segmentation. Lyu et al. (2022) proposed an augmentation-based domain generalization method for renal image segmentation, this method generates novel domains for training. And novel proxy task maximizes the diversity between novel domains. Wang et al. (2020a) proposed a domain-oriented feature embedding to improve domain generalization for fundus image segmentation. Wang et al. (2019) also presented a method based on unsupervised domain adaptation via boundary-entropy-driven adversarial learning for optic disc (OD) and optic cup (OC) segmentation from fundus images. Liu et al. (2022) use T2-weighted MRIs from three public datasets including NCI-ISBI13 (Bloch et al. 2015), I2CVB (Lemaître et al. 2015), PROMISE (Litjens et al. 2014) for Prostate MRI segmentation and REFUGE (Orlando et al. 2020), DristhiGS (Sivaswamy et al. 2015), RIM-ONE-r3 (Fumero et al. 2011) for Fundus image segmentation. There are a few works on single-domain generalization(SGD). (Su et al. 2023) use cross-modality abdominal dataset (Landman et al. 2015) and cross-sequence cardiac dataset (Zhang et al. 2021b) for two single-source domain generalization (SDG) tasks. Yao et al. (2022) utilize M&M (Campello et al. 2021) and SCGM dataset (Prados et al. 2017) for multi-disease cardiac image segmentation. Xu et al. (2022a) conducted the experiment on cross-modality image segmentation with the abdominal CT scan (Landman et al. 2015) and MRI scans (Kavur et al. 2021). Bi et al. (2023) evaluate proposed MI-SegNet, a medical segmentation framework that is evaluated on ValS, TS1, TS2, TS3 (Říha et al. 2013) datasets. In Hu et al. (2021) Hu et al. demonstrate the effectiveness of the proposed DAC, CAC module on prostate segmentation using MRI, COVID-19 lesion segmentation using CT and OC/CD segmentation using color fundus image (Wang et al. 2020a; Liu et al. 2020; Tsai et al. 2021). Wang et al. (2020a) evaluate novel Domain-oriented Feature Embedding (DoFE) framework on optic cup (OC) / disc (OD) segmentation and vessel segmentation with retinal fundus image dataset. For semantic segmentation of hyperspectral images, Sellner et al. (2023) use 600 intraoperative Hyperspectral Images (HSI) under geometric domain shift. For left atrial(LA) segmentation, Li et al. (2021b) use late gadolinium-enhanced magnetic resonance imaging (LGE MRI) from MICCAI 2018 Atrial Segmentation Challenge (Pop et al. 2019) and ISBI 2012 Left Atrium Fibrosis and Scar Segmentation Challenge (Meng et al. 2020). Liu et al. (2021a) evaluate the proposed MixSearch framework on Composite, ISIC, CVC, Union, and CHAOS-CT datasets. Gu et al. (2021) demonstrate experimental results of proposed DCA-Net on multi-site prostate MRI segmentation using T2-weighted MRI dataset (Lemaître et al. 2015). Lyu et al. (2022) validate the proposed AADG framework on fundus vessel, OD/OC, retinal lesion, and OCTA vessel segmentation. Zhou et al. (2022b) demonstrate the effectiveness of the presented framework on Fundus (Wang et al. 2020a) and Prostate (Liu et al. 2020) segmentation task.

7 DGSS datasets and evaluation

7.1 Datasets

We describe most of the common and widely used benchmarks for DGSS task. DGSS benchmarks are divided into synthetic datasets and real-world datasets, there are some other rarely used datasets like ADE20k (Zhou et al. 2019), and MSeg (Lambert et al. 2020) available for DGSS, that are shown in Table 2.

GTA-V. GTA-V (Richter et al. 2016) is a synthetic semantic segmentation dataset that consists of nearly 25,000 densely labeled samples with 19 individual classes. The resolution of each sample is \(1914 \times 1052\) pixels. It is extensively used in domain-generalized segmentation tasks.

Cityscapes. Cityscapes (Cordts et al. 2016) is a real-world driving dataset that consists of nearly 5000 labeled samples with 30 individual classes. The resolution of labeled samples is \(2048 \times 1024\) pixels. In most of the DG literature, this dataset is used as a target set.

Mapillary. Mapillary (Neuhold et al. 2017) is a real-world semantic segmentation dataset that consists of 25000 labeled samples of 66 classes. The resolution of each sample is \(1920 \times 1024\) pixels.

SYNTHIA. SYNTHIA (Ros et al. 2016) is a synthetic dataset as its name suggests. It is developed for semantic segmentation for urban scene understanding. It contains three different weather and illumination conditions, across three different road conditions (Highway, New York-like, and Old Europan Town). The majority of the work utilized 13 classes from this dataset, which has 9400 labeled samples. The resolution of each sample is \(960 \times 720\) pixels.

KITTI. KITTI (Geiger et al. 2012) is a real-world semantic segmentation dataset that consists of nearly 400 labeled samples of 28 classes. The resolution of each sample is \(1240 \times 376\) pixels.

IDD. IDD (Varma et al. 2019) is a real-world driving dataset that consists of nearly 10,000 labeled samples of 34 classes. The resolution of each sample is \(1678 \times 968\) pixels.

BDD100k. BDD100k (Yu et al. 2020) is a real-world driving dataset that consists of 10000 labeled samples of 19 classes. The resolution of each sample is \(1280 \times 720\) pixels.

ADCD. ADCD (Sakaridis et al. 2021) is a real-world driving dataset that consists of 4000 labeled samples of 19 classes. The resolution of each sample is \(1920 \times 1080\) pixels (Table 1).

Table 1 Popular and extensively used datasets in domain generalization for semantic segmentation task

8 Future research directions

8.1 Variation in segmentation models

In most of the recent work we see, they used variants of the DeepLab model as a segmentation model. On the other hand, ResNet-50/ResNet-101 and VGG-16 are the backbone networks in most of the works. However, there is no work that utilizes the power of vision transformers in DG research. Utilizing the full power of vision transformers can lead to promising results in multiple challenging conditions. However, it is not well-answered how vision transformers can perform in domain gaps, hence these ViTs should be extensively explored as a backbone network.

8.2 Continual domain generalization

In many real-world applications, a system can encounter online data that belong to non-stationary distributions. So, to make a more robust segmentation model, generalization should be continuous against the non-stationary distributed data. It allows the model to learn and adapt data efficiently without catastrophic forgetting (Douillard et al. (2021)). Apparently, there is no work that has been done while focusing on this area.

8.3 Test-time generalization in segmentation

Most of the generalizations have been done in the training phase, we can also explore the inference phase to make it more concrete for real-world applications. It will allow us to leverage the full power of domain adaptation and generalization in a single framework. Test-time generalization helps to allow more flexibility and efficiency under limited resources (Wang et al. 2022a).

8.4 Large-scale benchmark

Most of the benchmarks are relatively small considering industrial applications. To achieve better generalization, we need a large-scale benchmark to overcome non-stationary shifts in the real-time target generalization. Currently, most of the segmentation tasks are done with the camera module, but autonomous vehicles or other related applications actively leverage the

8.5 Interpretability

Domain-invariant methods provide some interpretation in DG for segmentation tasks. However, other conventional DG methods do not provide enough or are not comprehensively interpretable. But in many cases, we need to understand how the comprehended results are more close to the input space. This area can be explored for autonomous driving applications.

8.6 Vision-language models

Recently, vision-language models (VLMs) (Zhang et al. 2024a) have shown remarkable zero-shot transfer ability due to explicit vision-language pre-training (Gao et al. 2022; Bao et al. 2022) on multiple downstream tasks. Based on the applications, VLMs are becoming useful in OOD generalization tasks, hence multiple researches have explored plausible solutions for VLMs on domain generalization (Chen et al. 2024; Wang et al. 2024; Li et al. 2023b). However, most solutions focus general-purpose domain generalization, rather than specialized for the out-of-distribtuion segmentation tasks. So, this area can be explored due to the recent high potential of vision-language models in multiple applications.

8.7 Open vocabulary learning

Due to the recent emergence of vision-language models, open vocabulary learning (Wu et al. 2024) is proposed. Where models can discover categories beyond the training set category, it is no longer restricted to close-set classification. Recently, open vocabulary learning is adopted for domain adaptation tasks (Huang et al. 2023). However, it is widely considered for semantic segmentation tasks (Xu et al. 2023b; Liang et al. 2023; Xu et al. 2023a). So, this area also has a high-potential for DG-based semantic segmentation tasks. Future researches can be explored in this area.

8.8 Multimodal large-language models

Recently, multimodal large language models become the hotshot in AI research. It certainly has surprising power to deliver many downstream tasks, such as writing, generating codes or math reasoning. However, multimodal large language models are getting utilized in semantic segmentation tasks (Yang et al. 2024; Li et al. 2022a). Although, MLLMs are less popular in solving OOD problems (Zhang et al. 2024b). So, this area also can be explored particularly for DG for semantic segmentation.

9 Conclusion

In this paper, we comprehensively review the recent advances in domain generalization in semantic segmentation. In semantic segmentation tasks, domain adaptation is widely explored, but domain generalization is not well adopted. But generalization solves more tasks in more realistic and challenging scenarios. Our survey focuses on a very promising area of domain generalization in semantic segmentation. Most recent works have focused on domain adaptation in segmentation tasks, but the main challenge is large-scale deployment in industrial settings. We have explored recent generalization methods that are used in segmentation and provide a comprehensive overview of the whole scenario. We provide related background and methods that are extensively used in semantic segmentation alongside domain generalization. We also provide a critical analysis that is found in our observation as a future research direction in DG for segmentation. Based on the critical analysis, we recommend exploring variation in new baseline segmentation models, continual generalization in a real-world setting, test-time generalization, and interpretation. However, we believe that this survey will bring a new dimension to the community and interest in applying domain generalization in semantic segmentation tasks.