1 Introduction

An intelligent combination of redundant, complementary functional modules is one of the strategies to enhance the accuracy of an overall system as well as its robustness to unseen data. Existing aggregation methods either combine neural network outputs, as is the case with ensembles, or include combinations at the structural level, generally referred to as fusion, including multi-stream and multi-head neural networks.

Ensembles of identical models are a typical approach to gain superior accuracy by combining several distinct modules. Combination strategies include (un-)weighted averaging, majority voting, stacking, or creating a committee with a combination rule. Ensembles are conceptually simple, deterministic combination rules that are easy to implement and interpret. The approach also easily scales to a multitude of modules. Deep ensembles, i.e., ensembles of deep neural networks, trained with different random initialization parameters, have become one of the de facto standards for achieving better test accuracy and model robustness [LPB17].

However, since combination rules in ensembling are static, no sophisticated combinations are possible. In the case of deep learning models, each model usually needs to be trained separately. In addition, the overall results can only be calculated after the outputs of all experts are calculated.

Fusion is on the opposite side in the spectrum of model combination approaches. Information from several models is combined either starting from early layers (early fusion), over multiple network layers (slow fusion), or by merging only the last layers (late fusion) [KTS+14]. Regardless of the fusion type, these methods usually involve a complex implementation and allow for no or only little insights into the decision-making process itself.

Fig. 1
figure 1

In an ensemble (a), the combination rule is deterministic, so that the distribution of expert weights is fixed for all inputs. A mixture of experts (MoE) (b) contains an additional gate module, which predicts the distribution of the expert weights for each input

The mixture-of-experts architecture, first proposed by Jacobs et al. [JJNH91], takes a middle path and combines the simplicity and interpretability of the result with the possibility to form sophisticated network architectures. Similar to an ensemble, a mixture of experts combines several separate modules named experts. However, an additional, trainable gating component allows to perform weighting of the expert outputs on a per-input basis (see Fig. 1).

Here, we want to introduce the general concept more thoroughly and extend the evaluation from our previous work [PHW+20] to demonstrate the abilities and possibilities of MoE architectures to perform the task at hand (semantic segmentation of traffic scenes) and to provide post-hoc explainability via a comparison of internal decisions.

The contributions on top of our previous work [PHW+20] are as follows: The experiments are extended to include and evaluate a second expert architecture (DeepLabv3+). We also provide new results for the complete label set of the A2D2 dataset. Moreover, a deeper analysis of the impact of the chosen feature extraction layer for the FRRN architecture is performed, whereas in previous work only two last pooling layers were considered. Then, expert weights as predicted by different gate architectures and the benefit of an additional convolutional layer is analyzed and discussed. A more advanced MoE architecture is considered and evaluated, in which the encoder layers of the experts are shared. Finally, an additional comparison of the proposed MoE architecture to an ensemble of experts is made.

2 Related Works

Mixtures of experts have primarily been applied for the divide-and-conquer tasks, where the input space is split into disjoint subsets, so that each expert is specialized on a separate subtask and a gating component learns a distribution of these experts.

One of the application areas of MoEs studied in current publications is the fine-grained image classification, where the task is to discriminate between classes in a sub-category of objects, such as the bird classification task [GBM+16]. Ahmed et al. [ABT16] propose a tree-like structure of experts, where the first expert (a generalist) learns to discriminate between coarse groups of classes, whereas further experts (specialists) learn to categorize within a specific group of objects. This way, at inference time, the generalist model acts as a gate by selecting the correct specialist for each input.

The mixture-of-experts approach has also been extensively applied to tackle multi-sensor fusion. To achieve this, each expert is trained on a certain sensor modality and a gating component combines outputs from experts for different input types. Mees et al. [MEB16] apply a mixture of three experts for different modalities (RGB, depth, and optical flow input) to the task of object recognition in mixed indoor and outdoor environments. An analysis of the weights predicted by a gate shows that the MoE adapts to the changing environment by choosing the most appropriate combination of expert weights—e.g., a depth expert gets a higher weight for darker image frames. A similar setup is used in [VDB16], where it is applied to the task of semantic segmentation. Similar to the previous setting, the MoE demonstrates the ability to compensate for a failed sensor via an appropriate weight combination.

Furthermore, in [FC19] and [FC20] the MoE approach is used to handle the multi-modal end-to-end learning for autonomous driving.

A stacked MoE model, consisting of two MoEs, each containing its own experts and corresponding gates, is proposed in [ERS14]. Although evaluated only on MNIST, this work was an important step toward studying large models, which apply conditional execution of a certain part of the whole architecture for each image. The idea was further developed in an influential work on sparse MoEs for language modeling and machine translation [SMM+17], where a generic MoE layer is proposed. Similar to a general MoE, the embedded MoE layer contains several expert networks as well as a gating component, but it forms only a part of the overall architecture trained in an end-to-end manner. During the inference, the experts with non-zero weights are activated in each MoE layer, thus enabling conditional computation on a per-input basis.

The mixture-of-experts approach is also tightly connected to the selective execution or conditional computation paradigm. Normally, a straightforward way to boost the performance of a deep learning model, given a sufficiently large dataset, is to increase its capacity, i.e., the number of parameters. This, however, leads to the quadratic increase in the computational costs for training [SMM+17]. One of the approaches to overcome this is to execute only a certain part of the network in an input-dependent manner. In this area, primarily approaches such as dynamic deep networks [LD18] have been used. A large model with hundreds of MoE layers with dynamic routing is explored in [WYD+19]. Recent work [LLX+21] continues this line of research and applies conditional computation and automatic sharding to a large, sparsely gated MoE combined with transformers for neural machine translation.

In contrast to existing work, we treat the MoE not only as a method to achieve higher accuracy but also as a post-hoc interpretability approach. The described architecture is closer to earlier MoE methods [MEB16] and less complex than the stacked and embedded MoE.

3 Methods

In this section, we introduce the mixture-of-experts approach formally and show how discrepancy masks may arise from a comparison of the intermediate predictions within an MoE and how they can serve as a means to allocate regions of high uncertainty in an input.

3.1 MoE Architecture

The mixture-of-experts architecture presented here originates from our previous work [PHW+20] and is partially inspired by Valada et al. [VDB16]. In this setting, the MoE consists of two experts, each trained on a separate subset of data, as well as a small gating network. Deviating from the standard MoE approach, we reuse existing feature maps, extracted from the encoder part of the experts, as inputs to a gate. These feature maps are much smaller and can thus speed up both training and inference (see Fig. 2).

Fig. 2
figure 2

Architecture of an MoE with separate feature extractors. The gate module receives a concatenation of feature maps \(\mathbf {f}_\ell ^{\text {expert}_1}\) and \(\mathbf {f}_\ell ^{\text {expert}_2}\), extracted from the layer \(\ell \) in an encoder part of each of the experts. The gate then predicts the expert weights \(w^{\text {expert}_1}\) and \(w^{\text {expert}_2}\)

Formally, an MoE \(\mathbf {F}^{\text {MoE}}\) consists of n experts \(\{\mathbf {F}^{\text {expert}_\nu }\}_{\nu =1...n}\) and a gate \(\mathbf {F}^{\text {gate}}\). Assume each expert \(\mathbf {F}^{\text {expert}_\nu }\) follows an encoder-decoder architecture. We select a certain feature extraction layer \(\ell \) from the encoder part of each expert. The gate \(\mathbf {F}^{\text {gate}}\) receives a concatenation of feature maps \(\mathbf {f}_\ell ^{\text {expert}_\nu }\) from layer \(\ell \) of each expert and computes expert weights for an input \(\mathbf {x} \in \mathcal {I}^{H\times W\times C}\) with height H, width W, number of channels \(C=3\), and \(\mathcal {I}= [0, 1]\) via

$$\begin{aligned} (w^{\text {expert}_1}, ..., w^{\text {expert}_n}) = \mathbf {F}^{\text {gate}}\left( \mathbf {f}_\ell ^{\text {expert}_\nu }\Big |_{\nu =1,...,n}\right) . \end{aligned}$$
(1)

To calculate the overall MoE prediction, the weighted sum of logits \(\mathbf {y}^{\text {expert}_\nu }\) of all experts is computed as

$$\begin{aligned} \mathbf {y}= \mathbf {F}^{\text {MoE}}(\mathbf {x}) = \sum _{\nu =1}^{n}w^{\text {expert}_\nu }\cdot \mathbf {y}^{\text {expert}_\nu }. \end{aligned}$$
(2)

The resulting approach can thus be interpreted as an extension of ensembles, where weighting of the outputs of individual models is not predefined, but chosen according to the outputs of a trainable gate.

To further increase the capacity of the model, this architecture can be enhanced in a late fusion manner. This can be achieved by appending an additional convolutional layer which processes the weighted sum of the expert outputs before the final prediction \(\mathbf {F}^{\text {MoE}}\) is computed (c.f. Fig. 2).

Since convolutional neural networks tend to learn similar generic features in the lower layers, these layers can be shared across experts to reduce the overall number of parameters (see Fig. 3). In this case, each of the expert decoders and the gate receive the same pre-processed input \(\mathbf {f}_\ell ^{\text {encoder}}\).

Fig. 3
figure 3

Architecture of an MoE with a shared feature extractor. In contrast to the architecture in Fig. 2, the gate module receives a single feature map \(\mathbf {f}_\ell ^{\text {encoder}}\) from the shared feature extractor as an input

3.2 Disagreements Within an MoE

The multi-component nature of an MoE allows analyzing the final predictions via comparison of several intermediate predictions, which are either taken into consideration or declined by the overall model. The final decision of an MoE is performed on a per-pixel basis and consists of weighting individual decisions of the participating experts.

By comparison of the predictions of single experts \(\mathbf {y}^{\text {expert}_\nu }\) and prediction of the MoE \(\mathbf {y}^{\text {MoE}}\), we are able to decide for each pixel whether its classification presents a difficulty to the overall model.

We consider the following three cases that could arise. The discrepancy mask \(\mathbf {y} = (y_{h,w}) \in \mathcal {B}^{H\times W}\) has the same height H and width W as the MoE prediction \(\mathbf {y}^{\text {MoE}}\), with h and w being row and column index, and \(\mathcal {B}= \left\{ 0, 1\right\} \). For each case, a corresponding discrepancy mask for an input image is calculated as follows (note that the symbol \(\bigwedge \) represents the logical conjunction operator):

  • Perfect case: Here, the prediction by all experts and the MoE is the same. The elements of the discrepancy mask \(\mathbf {y}^{\text {perfect}}\) for the perfect case are computed as follows:

    $$\begin{aligned} y_{h,w}^{\text {perfect}}=1 \iff \bigwedge _{\nu =1}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} = \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned}$$
    (3)
  • Normal case: Here, the prediction by one of the experts and the MoE is the same. This case can further be split into normal case 1, where the same prediction is output by expert 1 and the MoE, normal case 2 for expert 2 and the MoE, etc. The elements of the discrepancy mask \(\mathbf {y}^{\text {normal}_1}\) for the normal case 1 are computed as follows:

    $$\begin{aligned} \begin{aligned} y_{h,w}^{\text {normal}_1}=1 \iff \bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_1})\big )_{h,w} = \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg ) \wedge \\ \bigwedge _{\nu =2}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} \ne \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned} \end{aligned}$$
    (4)
  • Critical case: The MoE and experts output different predictions. The elements of the discrepancy mask \(\mathbf {y}^{\text {critical}}\) for the critical case are computed as follows:

    $$\begin{aligned} y_{h,w}^{\text {critical}}=1 \iff \bigwedge _{\nu =1}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} \ne \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned}$$
    (5)

The critical case, where the MoE outputs a prediction, which was not foreseen by any of the experts, helps to identify image regions, which are challenging for the current model and could, for example, require more relevant examples in the training data.

4 Experiment Setup

To demonstrate the mixture-of-experts architecture, we chose the application of semantic image segmentation of traffic scenes. The experiments build upon and extend our previous work [PHW+20].

4.1 Datasets and Metrics

Dataset: All experiments were performed using the A2D2 dataset [GKM+20], consisting of RGB images with detailed semantic segmentation masks. In addition to the semantic segmentation labels, we manually labeled front camera images by road type (highway vs. urban vs. ambiguous). The disjoint subsets of highway and urban images were then used to train two expert networks, one for each data subset. Even though the urban images outnumber those depicting highway scenes, the same number of training (6132) and validation (876) samples are used for each expert. Additionally, 1421 test samples of each of the three road classes (including ambiguous) are available. All images are resized to \(640\times 480\) pixels for training and inference. In contrast to previous work [PHW+20], we use the complete label set of the A2D2 dataset here, comprising 38 classes.

Metrics: The distribution of labels in the A2D2 dataset is highly imbalanced. In particular, objects of nine out of 38 classes occur in less than 5% of images. To help alleviate this challenge, in addition to the mean intersection-over-union (mIoU), the frequency-weighted IoU (fwIoU) was used for evaluation purposes. In the case of the mIoU, the class-wise average is taken. To calculate the fwIoU, each class-wise IoU is weighted by class frequency, pre-computed as a ratio of pixels labeled as belonging to a certain class in the dataset.

4.2 Topologies

Expert architectures: To lay the focus on the overall mixture-of-experts architecture, two different expert base architectures are evaluated: the Full-Resolution Residual Network (FRRN) [PHML17] and DeepLabv3+ [CZP+18].

An FRRN network is comprised of two interconnected streams: a pooling and a residual one. While the residual stream carries the information at the original image resolution, the pooling stream follows the encoder-decoder approach with a series of pooling and unpooling operations. Furthermore, the whole FRRN is constructed as a series of the full-resolution residual units (FRRUs). Each FRRU takes information from both streams as input and also outputs information for both streams, thus having two inputs and two outputs. An FRRU contains two convolutional layers, each followed by a ReLU activation function.

For the following experiments, we use a shallower version of FRRN, namely FRRN-A. The encoder part of the FRRN-A pooling stream consists of ten FRRU blocks, which are distributed into four groups. Each group contains FRRU blocks with the same number of channels in their convolution layers; the group of FRRU blocks is correspondingly denoted by the number of channels (e.g., \(\text {FRRU}_{\text {96}}\) is a group of FRRUs containing convolutions with 96 channels). Each group is then followed by a pooling layer (max-pooling in our case).

The second expert architecture, DeepLabv3+, uses a previous network version (DeepLabv3) as an encoder and enhances it with a decoder module. This encoder-decoder architecture of DeepLabv3+ suggests to use the feature maps of the final encoder layer, namely the atrous spatial pyramid pooling (ASPP) layer, as input to the gate. For the experts based on DeepLabv3+, we evaluate architectures with the ResNet-101 backbone [HZRS16], pre-trained on ImageNet [RDS+15].

Gating: The gate is a core MoE module. It predicts expert probabilities, which can then be directly used to weight expert outputs. Figure 4 demonstrates the gate architecture. We consider two possibilities to incorporate a gate into the MoE architecture: a simple and a class-wise gate. A simple gate predicts a distribution of expert weights for the whole input image. The output of a simple gate is thus a list of n scalars, one for each expert. A class-wise gate, originally proposed in [VVDB17], is a list of simple gates, one for each label. Each simple gate follows the same architecture as shown in Fig. 4. The output of a class-wise gate is thus a tensor of dimension \(n\times m\), where n is the number of experts and m is the number of classes. Via multiplication with the two-dimensional gate outputs, the expert outputs are therefore weighted in a class-wise manner. Overall, a simple gate learns to predict how much each expert can be trusted for a specific input, whereas a class-wise gate learns to decide which expert is most suitable for each label given a specific input.

Fig. 4
figure 4

Gate architecture. Gate input \(\mathbf {f}_\ell ^{\text {experts}}\) is either the concatenation of feature maps \(\mathbf {f}_\ell ^{\text {expert}_1}\) and \(\mathbf {f}_\ell ^{\text {expert}_2}\) (in case of MoE architecture from Fig. 2) or a single feature map \(\mathbf {f}_\ell ^{\text {encoder}}\) from the shared encoder (in case of the MoE architecture from Fig. 3). Gate outputs \(w^{\text {expert}_1}\) and \(w^{\text {expert}_2}\) are the corresponding expert weights

Moreover, we evaluate the addition of a further convolutional layer \(\mathbf {f}_\ell ^{\text {conv}}\), followed by a ReLU activation function. \(\mathbf {f}_\ell ^{\text {conv}}\) is additionally inserted (see Figs. 2 and 3), such that the overall MoE output is then computed as follows:

$$\begin{aligned} \mathbf {y} = \mathbf {F}^{\text {MoE}}(\mathbf {x}) = \boldsymbol{\sigma }\big (\mathbf {f}_\ell ^{\text {conv}}( \mathbf {y}^{\text {expert}_1} \cdot w^{\text {expert}_1} + \mathbf {y}^{\text {expert}_2} \cdot w^{\text {expert}_2})\big ). \end{aligned}$$
(6)

4.3 Training

Pipelines for training and evaluation are implemented in PyTorch [PGM+19]. NVIDIA GeForce GTX 1080 Ti graphics cards are used for training and inference. The MoE with separate feature extractors (as shown in Fig. 2) is trained in two stages. First, each expert is trained as a separate neural network on the corresponding data subset. Then, the expert weights are frozen and the whole MoE architecture is trained end-to-end. In the case of a shared encoder architecture (as shown in Fig. 3), the expert networks are trained jointly at the first stage, whereas their encoder layers share the parameters. Afterwards, the weights of the shared encoder and of the expert decoders are frozen and the overall MoE is trained end-to-end. Each expert is trained for 50 epochs with a batch size of two. The MoE is trained for 20 epochs with a batch size of six. A smaller batch size is used for larger feature maps. SGD is used as an optimizer for all models. The polynomial learning rate decay with an initial value set to 0.01 is used.

Table 1 FRRN-based architecture: mIoU/fwIoU for different feature extraction layers. The results are for the simple gate architecture with the extra convolutional layer
Table 2 FRRN-based architecture: mIoU/fwIoU for different gate architectures. The results are for the second \(\text {FRRU}_{384}\) as a feature extraction layer

5 Experimental Results and Discussion

In the following sections, we discuss our findings from the experiments regarding architectural choices, analyze the performance of the MoE model, and discuss its potential as an interpretability approach.

5.1 Architectural Choices

Architectural choices reported in this section refer to Tables 1, and 23 and have been obtained on the test dataset. Note that a good generalization on unseen test data can only be ensured if the respective ablation study is repeated on a separate validation dataset.

Expert feature extraction layer: For the FRRN-based architecture with separate encoders, which is shown in Fig. 2, we evaluate feature maps, extracted from the last max-pooling layer in each FRRU block of the encoder. We additionally evaluate the usage of input images directly as gate input as in the original MoE approach. Table 1 demonstrates the results for the MoE architecture with a simple gate with an extra convolutional layer. Using raw input data as gate input has led to better results only on the highway data, while achieving the worst results on the urban data parts. Using the pooling layer of the last FRRU block as input, however, achieves the highest mIoU values on the mixed dataset. A possible reason might be that the extracted high-level features from the later layers serve as a better representation of the more complex urban images. The feature resolution shrinks via a series of pooling operations in the pooling stream, therefore the last FRRU blocks additionally lead to shorter training times. Since the best results on the mixed dataset were also achieved by a model with the second \(\text {FRRU}_{384}\) feature extraction layer, it was selected for all further experiments.

Table 3 DeepLabv3+-based architecture: mIoU/fwIoU for different gate architectures

Gate architecture: To determine the best design choices for the gate architecture and whether to add additional convolutional layers after the gate, we train and evaluate MoE models with separate encoders (as shown in Fig. 2) for different combinations of these features.

The analysis of the gate’s predictions has demonstrated that the simple gate tends to select the correct expert for each data subset. As an example, the average expert weights as predicted by a simple gate for the DeepLabv3+-based architecture are as follows: 0.8 for the highway expert and 0.2 for the urban expert on the highway data subset, 0.04 for the highway expert and 0.96 for the urban expert on the urban data subset, and 0.52 for the highway expert and 0.48 for the urban expert on the ambiguous data.

The class-wise gate, however, predicts almost uniform weights for the majority of classes. Only a small overweight is assigned to the highway expert on the highway data and correspondingly to the urban expert on urban data. Classes for which the urban expert is assigned a significantly higher weight on all inputs are Signal corpus, Sidewalk, Buildings, Curbstone, Nature object, and RD Normal Street. Moreover, we observed that the class-wise gate tends to predict the same weights for all inputs.

Overall, the simple gate demonstrates much higher flexibility to input data, which explains the better results of the corresponding architectures (see Tables 2 and 3). Also, regardless of the gate architecture, an additional convolutional layer also leads to higher mIoU values, because the overall model has a higher capacity. Both FRRN- and DeepLabv3+-based MoEs demonstrate the best performance for the simple gate with an additional convolutional layer after the gate.

Table 4 FRRN-based architecture: mIoU/fwIoU for experts and the MoE. For the MoE, the results are shown for a simple gate with convolutional layer and second \(\text {FRRU}_{384}\) as its feature extraction layer

5.2 MoE Performance

To compare the performance of the proposed MoE architecture with a baseline and with further aggregation methods, we focus in the following experiments on the MoE with the separate feature encoders (as shown in Fig. 2), which uses the simple gate architecture, an additional convolutional layer and, in case of the FRRN-based model, the second \(\text {FRRU}_{384}\) as its feature extraction layer. Our conclusions on the performance of this model are also valid for almost all further architectural choices, presented in the previous subsection (cf. Tables 12, and 3). Results discussed in this section refer to Tables 4 and 5 and have been obtained on the test dataset.

MoE versus baseline versus experts: To ensure a proper basis for comparison, we train a baseline on a combined dataset of highway and urban data. Its architecture is identical to that of an individual expert. Since the baseline was exposed to twice as much data as each expert, it outperforms the expert models that are trained on their respective data subset only. As expected, each of the experts demonstrates the best results on its corresponding data subset, whereas the urban expert shows a slightly better generalization due to a higher diversity of traffic scenes in its training data. The performance of the DeepLabv3+-based MoE with the separate feature encoders (as shown in Fig. 2) surpasses that of the baseline on the mixed dataset, whereas the FRRN-based MoE approaches the baseline.

MoE versus ensemble: We also evaluated an ensemble of experts as a concurrent aggregation approach. For this, the class-wise probabilities, predicted by the pre-trained experts, were combined using either by taking a mean or a maximum over the experts. Both combination strategies consistently underperform compared to an MoE with separate feature encoders on all data subsets, except for the FRRN-based architecture on highway data. Moreover, of those two approaches, the combination via maximum has led to slightly better results.

MoE with a shared encoder: For the shared encoder architecture as shown in Fig. 3, the differences in performance of experts are less drastic, whereas the MoE demonstrates a slightly inferior mIoU when compared to an MoE with separate encoders. It seems as if the insufficient specialization of the experts negatively affects the performance of the MoE model. Although the shared encoder provides faster inference, a further drawback of this approach is that joint training of both shared experts is much more time-consuming. This might limit the usage of the architecture, especially when a fast replacement of certain experts during inference is considered.

Table 5 DeepLabv3+ -based architecture: mIoU/fwIoU for experts and the MoE. For the MoE, the results refer to the simple gate with convolutional layer

5.3 Disagreement Analysis

We compare pixel-wise predictions of each expert and of the overall MoE architecture. We report percentage of pixels belonging to each disagreement class in Tables 6 and 7, using the test dataset. In both architectures, the experts and the overall MoE output the same predictions (perfect case) for the majority of pixels. Pixels belonging to the normal and critical cases take up to 5% of an image for highway and ambiguous data. For urban data, the MoE tends to rely heavily on the urban expert (up to 27% of pixels), the urban expert is also more accurate on this data according to the reported mIoU and fwIoU.

To facilitate visual analysis of the disagreement regions, we show a disagreement mask for each input (see Fig. 5). The perfect case pixels are left transparent, the normal case pixels are colored green, and blue for the highway and urban experts correspondingly. The critical case pixels are highlighted in red. The visual analysis is consistent with the semantics of the objects in the scene. Regions mapped to the normal case are those, which can confidently be segmented by the corresponding expert, because they mostly occur in its training set and not in the training set of a different expert.

Table 6 FRRN-based architecture: percentage of pixels, belonging to each disagreement case. The results refer to the architecture with an additional convolutional layer after the gate and second \(\text {FRRU}_{384}\) as a feature extraction layer
Table 7 DeepLabv3+ -based architecture: percentage of pixels, belonging to each disagreement case. The results refer to the architecture with an additional convolutional layer after the gate

The critical cases are small image areas (up to 1% of pixels). Because we consider ambiguous traffic scenes as out-of-distribution in our setting, they provide the most interesting material for the visual analysis of the critical cases (see Fig. 6). The critical case areas are usually the overexposed, blurred, ambiguous, or hardly visible regions of an image. Interestingly, sidewalk pixels are often classified as belonging to the critical case in the ambiguous dataset.

Fig. 5
figure 5

Disagreement masks and predictions. Perfect case pixels are shown as image pixels, normal case pixels are highlighted green (agreement with the highway expert) or blue (agreement with the urban expert), and critical case pixels are highlighted red. The results are for the DeepLabv3+ architecture with a simple gate and additional convolutional layer

Fig. 6
figure 6

Examples of disagreement masks with critical case from the ambiguous test set. Perfect case pixels are shown as image pixels, normal case pixels are highlighted green (agreement with the highway expert) or blue (agreement with the urban expert), and critical case pixels are highlighted red. The results are for the DeepLabv3+ architecture with a simple gate and additional convolutional layer

Conclusions

Mixture of experts (MoE) is a network aggregation approach, which combines simplicity and intrinsic interpretability of ensembling methods with the possibility to construct more flexible models via the application of an additional gate module. In this chapter, we have studied how a mixture of experts can be used to aggregate deep neural networks for semantic segmentation to increase performance and gain additional interpretability.

Our experiments with two different expert architectures demonstrate that MoE is able to reach baseline performance and additionally reveal image regions, for which the model exhibits high uncertainty. In comparison to our previous experiments in [PHW+20], the models were trained on a full A2D2 label set, which led not only to decreased performance on rare classes, as expected, but also to the increased occurrence of disagreements between the overall architecture and the experts. Furthermore, we have evaluated the possibility to share the parameters of the feature extraction layers of both experts. This leads to better cross-subset expert performance. However, apparently the experts with a shared encoder are no longer specialized enough which leads to a worse MoE performance when compared to an MoE with standalone experts. Our evaluation has also shown that a mixture of experts, enhanced with a gating network, beats a simple combination of experts via ensembling.

Further research directions might include conditional execution within an MoE model, combination of various modalities as inputs, as well as the perspective to further enhance the interpretability and robustness of the overall model.