Evaluating Mixture-of-Experts Architectures for Network Aggregation

Pavlitskaya, Svetlana; Hubschneider, Christian; Weber, Michael

doi:10.1007/978-3-031-01233-4_11

Svetlana Pavlitskaya⁴,
Christian Hubschneider⁴ &
Michael Weber⁴

7785 Accesses
3 Citations

Abstract

The mixture-of-experts (MoE) architecture is an approach to aggregate several expert components via an additional gating module, which learns to predict the most suitable distribution of the expert’s outputs for each input. An MoE thus not only relies on redundancy for increased robustness—we also demonstrate how this architecture can provide additional interpretability, while retaining performance similar to a standalone network. As an example, we train expert networks to perform semantic segmentation of the traffic scenes and combine them into an MoE with an additional gating network. Our experiments with two different expert model architectures (FRRN and DeepLabv3+) reveal that the MoE is able to reach, and for certain data subsets even surpass, the baseline performance and also outperforms a simple aggregation via ensembling. A further advantage of an MoE is the increased interpretability—a comparison of pixel-wise predictions of the whole MoE model and the participating experts’ help to identify regions of high uncertainty in an input.

You have full access to this open access chapter, Download chapter PDF

Diversified Dynamic Routing for Vision Tasks

Uncertainty Estimates and Multi-hypotheses Networks for Optical Flow

ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation

1 Introduction

An intelligent combination of redundant, complementary functional modules is one of the strategies to enhance the accuracy of an overall system as well as its robustness to unseen data. Existing aggregation methods either combine neural network outputs, as is the case with ensembles, or include combinations at the structural level, generally referred to as fusion, including multi-stream and multi-head neural networks.

Ensembles of identical models are a typical approach to gain superior accuracy by combining several distinct modules. Combination strategies include (un-)weighted averaging, majority voting, stacking, or creating a committee with a combination rule. Ensembles are conceptually simple, deterministic combination rules that are easy to implement and interpret. The approach also easily scales to a multitude of modules. Deep ensembles, i.e., ensembles of deep neural networks, trained with different random initialization parameters, have become one of the de facto standards for achieving better test accuracy and model robustness [LPB17].

However, since combination rules in ensembling are static, no sophisticated combinations are possible. In the case of deep learning models, each model usually needs to be trained separately. In addition, the overall results can only be calculated after the outputs of all experts are calculated.

Fusion is on the opposite side in the spectrum of model combination approaches. Information from several models is combined either starting from early layers (early fusion), over multiple network layers (slow fusion), or by merging only the last layers (late fusion) [KTS+14]. Regardless of the fusion type, these methods usually involve a complex implementation and allow for no or only little insights into the decision-making process itself.

The mixture-of-experts architecture, first proposed by Jacobs et al. [JJNH91], takes a middle path and combines the simplicity and interpretability of the result with the possibility to form sophisticated network architectures. Similar to an ensemble, a mixture of experts combines several separate modules named experts. However, an additional, trainable gating component allows to perform weighting of the expert outputs on a per-input basis (see Fig. 1).

Here, we want to introduce the general concept more thoroughly and extend the evaluation from our previous work [PHW+20] to demonstrate the abilities and possibilities of MoE architectures to perform the task at hand (semantic segmentation of traffic scenes) and to provide post-hoc explainability via a comparison of internal decisions.

The contributions on top of our previous work [PHW+20] are as follows: The experiments are extended to include and evaluate a second expert architecture (DeepLabv3+). We also provide new results for the complete label set of the A2D2 dataset. Moreover, a deeper analysis of the impact of the chosen feature extraction layer for the FRRN architecture is performed, whereas in previous work only two last pooling layers were considered. Then, expert weights as predicted by different gate architectures and the benefit of an additional convolutional layer is analyzed and discussed. A more advanced MoE architecture is considered and evaluated, in which the encoder layers of the experts are shared. Finally, an additional comparison of the proposed MoE architecture to an ensemble of experts is made.

2 Related Works

Mixtures of experts have primarily been applied for the divide-and-conquer tasks, where the input space is split into disjoint subsets, so that each expert is specialized on a separate subtask and a gating component learns a distribution of these experts.

One of the application areas of MoEs studied in current publications is the fine-grained image classification, where the task is to discriminate between classes in a sub-category of objects, such as the bird classification task [GBM+16]. Ahmed et al. [ABT16] propose a tree-like structure of experts, where the first expert (a generalist) learns to discriminate between coarse groups of classes, whereas further experts (specialists) learn to categorize within a specific group of objects. This way, at inference time, the generalist model acts as a gate by selecting the correct specialist for each input.

The mixture-of-experts approach has also been extensively applied to tackle multi-sensor fusion. To achieve this, each expert is trained on a certain sensor modality and a gating component combines outputs from experts for different input types. Mees et al. [MEB16] apply a mixture of three experts for different modalities (RGB, depth, and optical flow input) to the task of object recognition in mixed indoor and outdoor environments. An analysis of the weights predicted by a gate shows that the MoE adapts to the changing environment by choosing the most appropriate combination of expert weights—e.g., a depth expert gets a higher weight for darker image frames. A similar setup is used in [VDB16], where it is applied to the task of semantic segmentation. Similar to the previous setting, the MoE demonstrates the ability to compensate for a failed sensor via an appropriate weight combination.

Furthermore, in [FC19] and [FC20] the MoE approach is used to handle the multi-modal end-to-end learning for autonomous driving.

A stacked MoE model, consisting of two MoEs, each containing its own experts and corresponding gates, is proposed in [ERS14]. Although evaluated only on MNIST, this work was an important step toward studying large models, which apply conditional execution of a certain part of the whole architecture for each image. The idea was further developed in an influential work on sparse MoEs for language modeling and machine translation [SMM+17], where a generic MoE layer is proposed. Similar to a general MoE, the embedded MoE layer contains several expert networks as well as a gating component, but it forms only a part of the overall architecture trained in an end-to-end manner. During the inference, the experts with non-zero weights are activated in each MoE layer, thus enabling conditional computation on a per-input basis.

The mixture-of-experts approach is also tightly connected to the selective execution or conditional computation paradigm. Normally, a straightforward way to boost the performance of a deep learning model, given a sufficiently large dataset, is to increase its capacity, i.e., the number of parameters. This, however, leads to the quadratic increase in the computational costs for training [SMM+17]. One of the approaches to overcome this is to execute only a certain part of the network in an input-dependent manner. In this area, primarily approaches such as dynamic deep networks [LD18] have been used. A large model with hundreds of MoE layers with dynamic routing is explored in [WYD+19]. Recent work [LLX+21] continues this line of research and applies conditional computation and automatic sharding to a large, sparsely gated MoE combined with transformers for neural machine translation.

In contrast to existing work, we treat the MoE not only as a method to achieve higher accuracy but also as a post-hoc interpretability approach. The described architecture is closer to earlier MoE methods [MEB16] and less complex than the stacked and embedded MoE.

3 Methods

In this section, we introduce the mixture-of-experts approach formally and show how discrepancy masks may arise from a comparison of the intermediate predictions within an MoE and how they can serve as a means to allocate regions of high uncertainty in an input.

3.1 MoE Architecture

The mixture-of-experts architecture presented here originates from our previous work [PHW+20] and is partially inspired by Valada et al. [VDB16]. In this setting, the MoE consists of two experts, each trained on a separate subset of data, as well as a small gating network. Deviating from the standard MoE approach, we reuse existing feature maps, extracted from the encoder part of the experts, as inputs to a gate. These feature maps are much smaller and can thus speed up both training and inference (see Fig. 2).

Formally, an MoE $\mathbf {F}^{\text {MoE}}$ consists of n experts $\{\mathbf {F}^{\text {expert}_\nu }\}_{\nu =1...n}$ and a gate $\mathbf {F}^{\text {gate}}$. Assume each expert $\mathbf {F}^{\text {expert}_\nu }$ follows an encoder-decoder architecture. We select a certain feature extraction layer $\ell $ from the encoder part of each expert. The gate $\mathbf {F}^{\text {gate}}$ receives a concatenation of feature maps $\mathbf {f}_\ell ^{\text {expert}_\nu }$ from layer $\ell $ of each expert and computes expert weights for an input $\mathbf {x} \in \mathcal {I}^{H\times W\times C}$ with height H, width W, number of channels $C=3$, and $\mathcal {I}= [0, 1]$ via

$$\begin{aligned} (w^{\text {expert}_1}, ..., w^{\text {expert}_n}) = \mathbf {F}^{\text {gate}}\left( \mathbf {f}_\ell ^{\text {expert}_\nu }\Big |_{\nu =1,...,n}\right) . \end{aligned}$$

(1)

To calculate the overall MoE prediction, the weighted sum of logits $\mathbf {y}^{\text {expert}_\nu }$ of all experts is computed as

$$\begin{aligned} \mathbf {y}= \mathbf {F}^{\text {MoE}}(\mathbf {x}) = \sum _{\nu =1}^{n}w^{\text {expert}_\nu }\cdot \mathbf {y}^{\text {expert}_\nu }. \end{aligned}$$

(2)

The resulting approach can thus be interpreted as an extension of ensembles, where weighting of the outputs of individual models is not predefined, but chosen according to the outputs of a trainable gate.

To further increase the capacity of the model, this architecture can be enhanced in a late fusion manner. This can be achieved by appending an additional convolutional layer which processes the weighted sum of the expert outputs before the final prediction $\mathbf {F}^{\text {MoE}}$ is computed (c.f. Fig. 2).

Since convolutional neural networks tend to learn similar generic features in the lower layers, these layers can be shared across experts to reduce the overall number of parameters (see Fig. 3). In this case, each of the expert decoders and the gate receive the same pre-processed input $\mathbf {f}_\ell ^{\text {encoder}}$.

3.2 Disagreements Within an MoE

The multi-component nature of an MoE allows analyzing the final predictions via comparison of several intermediate predictions, which are either taken into consideration or declined by the overall model. The final decision of an MoE is performed on a per-pixel basis and consists of weighting individual decisions of the participating experts.

By comparison of the predictions of single experts $\mathbf {y}^{\text {expert}_\nu }$ and prediction of the MoE $\mathbf {y}^{\text {MoE}}$, we are able to decide for each pixel whether its classification presents a difficulty to the overall model.

We consider the following three cases that could arise. The discrepancy mask $\mathbf {y} = (y_{h,w}) \in \mathcal {B}^{H\times W}$ has the same height H and width W as the MoE prediction $\mathbf {y}^{\text {MoE}}$, with h and w being row and column index, and $\mathcal {B}= \left\{ 0, 1\right\} $. For each case, a corresponding discrepancy mask for an input image is calculated as follows (note that the symbol $\bigwedge $ represents the logical conjunction operator):

Perfect case: Here, the prediction by all experts and the MoE is the same. The elements of the discrepancy mask $\mathbf {y}^{\text {perfect}}$ for the perfect case are computed as follows:
$$\begin{aligned} y_{h,w}^{\text {perfect}}=1 \iff \bigwedge _{\nu =1}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} = \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned}$$
(3)
Normal case: Here, the prediction by one of the experts and the MoE is the same. This case can further be split into normal case 1, where the same prediction is output by expert 1 and the MoE, normal case 2 for expert 2 and the MoE, etc. The elements of the discrepancy mask $\mathbf {y}^{\text {normal}_1}$ for the normal case 1 are computed as follows:
$$\begin{aligned} \begin{aligned} y_{h,w}^{\text {normal}_1}=1 \iff \bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_1})\big )_{h,w} = \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg ) \wedge \\ \bigwedge _{\nu =2}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} \ne \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned} \end{aligned}$$
(4)
Critical case: The MoE and experts output different predictions. The elements of the discrepancy mask $\mathbf {y}^{\text {critical}}$ for the critical case are computed as follows:
$$\begin{aligned} y_{h,w}^{\text {critical}}=1 \iff \bigwedge _{\nu =1}^{n}{\bigg ( \big (\arg \max (\mathbf {y}^{\text {expert}_\nu })\big )_{h,w} \ne \big (\arg \max (\mathbf {y}^{\text {MoE}})\big )_{h,w}\bigg )}. \end{aligned}$$
(5)

The critical case, where the MoE outputs a prediction, which was not foreseen by any of the experts, helps to identify image regions, which are challenging for the current model and could, for example, require more relevant examples in the training data.

4 Experiment Setup

To demonstrate the mixture-of-experts architecture, we chose the application of semantic image segmentation of traffic scenes. The experiments build upon and extend our previous work [PHW+20].

4.1 Datasets and Metrics

Dataset: All experiments were performed using the A2D2 dataset [GKM+20], consisting of RGB images with detailed semantic segmentation masks. In addition to the semantic segmentation labels, we manually labeled front camera images by road type (highway vs. urban vs. ambiguous). The disjoint subsets of highway and urban images were then used to train two expert networks, one for each data subset. Even though the urban images outnumber those depicting highway scenes, the same number of training (6132) and validation (876) samples are used for each expert. Additionally, 1421 test samples of each of the three road classes (including ambiguous) are available. All images are resized to $640\times 480$ pixels for training and inference. In contrast to previous work [PHW+20], we use the complete label set of the A2D2 dataset here, comprising 38 classes.

Metrics: The distribution of labels in the A2D2 dataset is highly imbalanced. In particular, objects of nine out of 38 classes occur in less than 5% of images. To help alleviate this challenge, in addition to the mean intersection-over-union (mIoU), the frequency-weighted IoU (fwIoU) was used for evaluation purposes. In the case of the mIoU, the class-wise average is taken. To calculate the fwIoU, each class-wise IoU is weighted by class frequency, pre-computed as a ratio of pixels labeled as belonging to a certain class in the dataset.

4.2 Topologies

Expert architectures: To lay the focus on the overall mixture-of-experts architecture, two different expert base architectures are evaluated: the Full-Resolution Residual Network (FRRN) [PHML17] and DeepLabv3+ [CZP+18].

An FRRN network is comprised of two interconnected streams: a pooling and a residual one. While the residual stream carries the information at the original image resolution, the pooling stream follows the encoder-decoder approach with a series of pooling and unpooling operations. Furthermore, the whole FRRN is constructed as a series of the full-resolution residual units (FRRUs). Each FRRU takes information from both streams as input and also outputs information for both streams, thus having two inputs and two outputs. An FRRU contains two convolutional layers, each followed by a ReLU activation function.

For the following experiments, we use a shallower version of FRRN, namely FRRN-A. The encoder part of the FRRN-A pooling stream consists of ten FRRU blocks, which are distributed into four groups. Each group contains FRRU blocks with the same number of channels in their convolution layers; the group of FRRU blocks is correspondingly denoted by the number of channels (e.g., $\text {FRRU}_{\text {96}}$ is a group of FRRUs containing convolutions with 96 channels). Each group is then followed by a pooling layer (max-pooling in our case).

The second expert architecture, DeepLabv3+, uses a previous network version (DeepLabv3) as an encoder and enhances it with a decoder module. This encoder-decoder architecture of DeepLabv3+ suggests to use the feature maps of the final encoder layer, namely the atrous spatial pyramid pooling (ASPP) layer, as input to the gate. For the experts based on DeepLabv3+, we evaluate architectures with the ResNet-101 backbone [HZRS16], pre-trained on ImageNet [RDS+15].

Gating: The gate is a core MoE module. It predicts expert probabilities, which can then be directly used to weight expert outputs. Figure 4 demonstrates the gate architecture. We consider two possibilities to incorporate a gate into the MoE architecture: a simple and a class-wise gate. A simple gate predicts a distribution of expert weights for the whole input image. The output of a simple gate is thus a list of n scalars, one for each expert. A class-wise gate, originally proposed in [VVDB17], is a list of simple gates, one for each label. Each simple gate follows the same architecture as shown in Fig. 4. The output of a class-wise gate is thus a tensor of dimension $n\times m$, where n is the number of experts and m is the number of classes. Via multiplication with the two-dimensional gate outputs, the expert outputs are therefore weighted in a class-wise manner. Overall, a simple gate learns to predict how much each expert can be trusted for a specific input, whereas a class-wise gate learns to decide which expert is most suitable for each label given a specific input.

Moreover, we evaluate the addition of a further convolutional layer $\mathbf {f}_\ell ^{\text {conv}}$, followed by a ReLU activation function. $\mathbf {f}_\ell ^{\text {conv}}$ is additionally inserted (see Figs. 2 and 3), such that the overall MoE output is then computed as follows:

$$\begin{aligned} \mathbf {y} = \mathbf {F}^{\text {MoE}}(\mathbf {x}) = \boldsymbol{\sigma }\big (\mathbf {f}_\ell ^{\text {conv}}( \mathbf {y}^{\text {expert}_1} \cdot w^{\text {expert}_1} + \mathbf {y}^{\text {expert}_2} \cdot w^{\text {expert}_2})\big ). \end{aligned}$$

(6)

4.3 Training

Pipelines for training and evaluation are implemented in PyTorch [PGM+19]. NVIDIA GeForce GTX 1080 Ti graphics cards are used for training and inference. The MoE with separate feature extractors (as shown in Fig. 2) is trained in two stages. First, each expert is trained as a separate neural network on the corresponding data subset. Then, the expert weights are frozen and the whole MoE architecture is trained end-to-end. In the case of a shared encoder architecture (as shown in Fig. 3), the expert networks are trained jointly at the first stage, whereas their encoder layers share the parameters. Afterwards, the weights of the shared encoder and of the expert decoders are frozen and the overall MoE is trained end-to-end. Each expert is trained for 50 epochs with a batch size of two. The MoE is trained for 20 epochs with a batch size of six. A smaller batch size is used for larger feature maps. SGD is used as an optimizer for all models. The polynomial learning rate decay with an initial value set to 0.01 is used.

Table 1 FRRN-based architecture: mIoU/fwIoU for different feature extraction layers. The results are for the simple gate architecture with the extra convolutional layer

Full size table

Table 2 FRRN-based architecture: mIoU/fwIoU for different gate architectures. The results are for the second $\text {FRRU}_{384}$ as a feature extraction layer

Full size table

5 Experimental Results and Discussion

In the following sections, we discuss our findings from the experiments regarding architectural choices, analyze the performance of the MoE model, and discuss its potential as an interpretability approach.

5.1 Architectural Choices

Architectural choices reported in this section refer to Tables 1, and 2, 3 and have been obtained on the test dataset. Note that a good generalization on unseen test data can only be ensured if the respective ablation study is repeated on a separate validation dataset.

Expert feature extraction layer: For the FRRN-based architecture with separate encoders, which is shown in Fig. 2, we evaluate feature maps, extracted from the last max-pooling layer in each FRRU block of the encoder. We additionally evaluate the usage of input images directly as gate input as in the original MoE approach. Table 1 demonstrates the results for the MoE architecture with a simple gate with an extra convolutional layer. Using raw input data as gate input has led to better results only on the highway data, while achieving the worst results on the urban data parts. Using the pooling layer of the last FRRU block as input, however, achieves the highest mIoU values on the mixed dataset. A possible reason might be that the extracted high-level features from the later layers serve as a better representation of the more complex urban images. The feature resolution shrinks via a series of pooling operations in the pooling stream, therefore the last FRRU blocks additionally lead to shorter training times. Since the best results on the mixed dataset were also achieved by a model with the second $\text {FRRU}_{384}$ feature extraction layer, it was selected for all further experiments.

Table 3 DeepLabv3+-based architecture: mIoU/fwIoU for different gate architectures

Full size table

Gate architecture: To determine the best design choices for the gate architecture and whether to add additional convolutional layers after the gate, we train and evaluate MoE models with separate encoders (as shown in Fig. 2) for different combinations of these features.

The analysis of the gate’s predictions has demonstrated that the simple gate tends to select the correct expert for each data subset. As an example, the average expert weights as predicted by a simple gate for the DeepLabv3+-based architecture are as follows: 0.8 for the highway expert and 0.2 for the urban expert on the highway data subset, 0.04 for the highway expert and 0.96 for the urban expert on the urban data subset, and 0.52 for the highway expert and 0.48 for the urban expert on the ambiguous data.

The class-wise gate, however, predicts almost uniform weights for the majority of classes. Only a small overweight is assigned to the highway expert on the highway data and correspondingly to the urban expert on urban data. Classes for which the urban expert is assigned a significantly higher weight on all inputs are Signal corpus, Sidewalk, Buildings, Curbstone, Nature object, and RD Normal Street. Moreover, we observed that the class-wise gate tends to predict the same weights for all inputs.

Overall, the simple gate demonstrates much higher flexibility to input data, which explains the better results of the corresponding architectures (see Tables 2 and 3). Also, regardless of the gate architecture, an additional convolutional layer also leads to higher mIoU values, because the overall model has a higher capacity. Both FRRN- and DeepLabv3+-based MoEs demonstrate the best performance for the simple gate with an additional convolutional layer after the gate.

Table 4 FRRN-based architecture: mIoU/fwIoU for experts and the MoE. For the MoE, the results are shown for a simple gate with convolutional layer and second $\text {FRRU}_{384}$ as its feature extraction layer

Full size table

5.2 MoE Performance

To compare the performance of the proposed MoE architecture with a baseline and with further aggregation methods, we focus in the following experiments on the MoE with the separate feature encoders (as shown in Fig. 2), which uses the simple gate architecture, an additional convolutional layer and, in case of the FRRN-based model, the second $\text {FRRU}_{384}$ as its feature extraction layer. Our conclusions on the performance of this model are also valid for almost all further architectural choices, presented in the previous subsection (cf. Tables 1, 2, and 3). Results discussed in this section refer to Tables 4 and 5 and have been obtained on the test dataset.

MoE versus baseline versus experts: To ensure a proper basis for comparison, we train a baseline on a combined dataset of highway and urban data. Its architecture is identical to that of an individual expert. Since the baseline was exposed to twice as much data as each expert, it outperforms the expert models that are trained on their respective data subset only. As expected, each of the experts demonstrates the best results on its corresponding data subset, whereas the urban expert shows a slightly better generalization due to a higher diversity of traffic scenes in its training data. The performance of the DeepLabv3+-based MoE with the separate feature encoders (as shown in Fig. 2) surpasses that of the baseline on the mixed dataset, whereas the FRRN-based MoE approaches the baseline.

MoE versus ensemble: We also evaluated an ensemble of experts as a concurrent aggregation approach. For this, the class-wise probabilities, predicted by the pre-trained experts, were combined using either by taking a mean or a maximum over the experts. Both combination strategies consistently underperform compared to an MoE with separate feature encoders on all data subsets, except for the FRRN-based architecture on highway data. Moreover, of those two approaches, the combination via maximum has led to slightly better results.

MoE with a shared encoder: For the shared encoder architecture as shown in Fig. 3, the differences in performance of experts are less drastic, whereas the MoE demonstrates a slightly inferior mIoU when compared to an MoE with separate encoders. It seems as if the insufficient specialization of the experts negatively affects the performance of the MoE model. Although the shared encoder provides faster inference, a further drawback of this approach is that joint training of both shared experts is much more time-consuming. This might limit the usage of the architecture, especially when a fast replacement of certain experts during inference is considered.

Table 5 DeepLabv3+ -based architecture: mIoU/fwIoU for experts and the MoE. For the MoE, the results refer to the simple gate with convolutional layer

Full size table

5.3 Disagreement Analysis

We compare pixel-wise predictions of each expert and of the overall MoE architecture. We report percentage of pixels belonging to each disagreement class in Tables 6 and 7, using the test dataset. In both architectures, the experts and the overall MoE output the same predictions (perfect case) for the majority of pixels. Pixels belonging to the normal and critical cases take up to 5% of an image for highway and ambiguous data. For urban data, the MoE tends to rely heavily on the urban expert (up to 27% of pixels), the urban expert is also more accurate on this data according to the reported mIoU and fwIoU.

To facilitate visual analysis of the disagreement regions, we show a disagreement mask for each input (see Fig. 5). The perfect case pixels are left transparent, the normal case pixels are colored green, and blue for the highway and urban experts correspondingly. The critical case pixels are highlighted in red. The visual analysis is consistent with the semantics of the objects in the scene. Regions mapped to the normal case are those, which can confidently be segmented by the corresponding expert, because they mostly occur in its training set and not in the training set of a different expert.

Table 6 FRRN-based architecture: percentage of pixels, belonging to each disagreement case. The results refer to the architecture with an additional convolutional layer after the gate and second $\text {FRRU}_{384}$ as a feature extraction layer

Full size table

Table 7 DeepLabv3+ -based architecture: percentage of pixels, belonging to each disagreement case. The results refer to the architecture with an additional convolutional layer after the gate

Full size table

The critical cases are small image areas (up to 1% of pixels). Because we consider ambiguous traffic scenes as out-of-distribution in our setting, they provide the most interesting material for the visual analysis of the critical cases (see Fig. 6). The critical case areas are usually the overexposed, blurred, ambiguous, or hardly visible regions of an image. Interestingly, sidewalk pixels are often classified as belonging to the critical case in the ambiguous dataset.

Conclusions

Mixture of experts (MoE) is a network aggregation approach, which combines simplicity and intrinsic interpretability of ensembling methods with the possibility to construct more flexible models via the application of an additional gate module. In this chapter, we have studied how a mixture of experts can be used to aggregate deep neural networks for semantic segmentation to increase performance and gain additional interpretability.

Our experiments with two different expert architectures demonstrate that MoE is able to reach baseline performance and additionally reveal image regions, for which the model exhibits high uncertainty. In comparison to our previous experiments in [PHW+20], the models were trained on a full A2D2 label set, which led not only to decreased performance on rare classes, as expected, but also to the increased occurrence of disagreements between the overall architecture and the experts. Furthermore, we have evaluated the possibility to share the parameters of the feature extraction layers of both experts. This leads to better cross-subset expert performance. However, apparently the experts with a shared encoder are no longer specialized enough which leads to a worse MoE performance when compared to an MoE with standalone experts. Our evaluation has also shown that a mixture of experts, enhanced with a gating network, beats a simple combination of experts via ensembling.

Further research directions might include conditional execution within an MoE model, combination of various modalities as inputs, as well as the perspective to further enhance the interpretability and robustness of the overall model.

References

K. Ahmed, M.H. Baig, L. Torresani, Network of experts for large-scale image categorization, in Proceedings of the European Conference on Computer Vision (ECCV) (Amsterdam, The Netherlands, 2016), pp. 516–532
Google Scholar
L-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proceedings of the European Conference on Computer Vision (ECCV) (Munich, Germany, 2018), pp. 833–851
Google Scholar
D. Eigen, M. Ranzato, I. Sutskever, Learning factored representations in a deep mixture of experts, in Proceedings of the International Conference on Learning Representations (ICLR) Workshops (Banff, AB, Canada, 2014), pp. 1–8
Google Scholar
S. Fang, A. Choromanska, Reconfigurable network for efficient inferencing in autonomous vehicles, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (Montréal, QC, Canada, 2019), pp. 1183–1189 , pp. 1183–1189
Google Scholar
S. Fang, A. Choromanska, Multi-modal experts network for autonomous driving, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Virtual conference (2020), pp. 6439–6445
Google Scholar
Z. Ge, A. Bewley, C. McCool, P. Corke, B. Upcroft, C. Sanderson, Fine-grained classification via mixture of deep convolutional neural networks, in Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (Lake Placid, NY, USA, 2016), pp. 1–6
Google Scholar
J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A.S. Chung, L. Hauswald, V.H. Pham, M. Mühlegg, S. Dorn, T. Fernandez, M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker, S. Garreis, P. Schuberth, A2D2: Audi Autonomous Driving Dataset (2020), pp. 1–10. arXiv:2004.06320
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV, USA, 2016), pp. 770–778
Google Scholar
R.A. Jacobs, M.I. Jordan, S.J. Nowlan, G.E. Hinton, Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)
Article Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L.F. Fei, Large-scale video classification with convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Columbus, OH, USA, 2014), pp. 1725–1732
Google Scholar
L. Liu, J. Deng, Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution, in Proceedings of the AAAI Conference on Artificial Intelligence (New Orleans, LA, USA, 2018), pp. 3675–3682
Google Scholar
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, Z. Chen, GShard: scaling giant models with conditional computation and automatic sharding, in Proceedings of the International Conference on Learning Representations (ICLR), virtual conference (2021), pp 1–35
Google Scholar
B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Long Beach, CA, USA, 2017), pp. 6402–6413
Google Scholar
O. Mees, A. Eitel, W. Burgard, Choosing smartly: adaptive multimodal fusion for object detection in changing environments, in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, Daejeon, Korea, 2016), pp. 151–156
Google Scholar
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, et al. PyTorch: an imperative style, high-performance deep learning library, in Proceedings of the Conference on Neural Information Processing Systems (NIPS/NeurIPS) (Vancouver, BC, Canada, 2019), pp. 8024–8035
Google Scholar
T. Pohlen, A. Hermans, M. Mathias, B. Leibe, Full-resolution residual networks for semantic segmentation in street scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, 2017), pp. 4151–4160
Google Scholar
S. Pavlitskaya, C. Hubschneider, M. Weber, R. Moritz, F. Hüger, P. Schlicht, J. Marius Zollner, Using mixture of expert models to gain insights into semantic segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), virtual conference (2020), pp. 1399–1406
Google Scholar
Olga Russakovsky, Jia Deng, Su. Hao, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li. Fei-Fei, ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Google Scholar
N. Shazeer, A.Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017), pp. 1–19. arXiv:1701.06538
A. Valada, A. Dhall, W. Burgard, Convoluted mixture of deep experts for robust semantic segmentation, in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Workshops (Daejeon, Korea, 2016), pp. 1–2
Google Scholar
A. Valada, J. Vertens, A. Dhall, W. Burgard, AdapNet: adaptive semantic segmentation in adverse environmental conditions, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (Singapore, Singapore, 2017), pp. 4644–4651
Google Scholar
X. Wang, F. Yu, L. Dunlap, Y-A. Ma, R. Wang, A. Mirhoseini, T. Darrell, J.E. Gonzalez, Deep mixture of experts via shallow embedding, in Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) (Tel Aviv, Isreal, 2019), pp. 552–562
Google Scholar

Download references

Acknowledgements

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “Methoden und Maßnahmen zur Absicherung von KI-basierten Wahrnehmungsfunktionen für das automatisierte Fahren (KI Absicherung)”. The authors would like to thank the consortium for their successful cooperation.

Author information

Authors and Affiliations

FZI Research Center for Information Technology, Haid-und-Neu-Str. 10-14, 76131, Karlsruhe, Germany
Svetlana Pavlitskaya, Christian Hubschneider & Michael Weber

Authors

Svetlana Pavlitskaya
View author publications
You can also search for this author in PubMed Google Scholar
Christian Hubschneider
View author publications
You can also search for this author in PubMed Google Scholar
Michael Weber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Svetlana Pavlitskaya .

Editor information

Editors and Affiliations

Institute for Communications Technology, Technische Universität Braunschweig, Braunschweig, Germany
Tim Fingscheidt
Fachgruppe Mathematik und Informatik, Bergische Universität Wuppertal, Wuppertal, Germany
Hanno Gottschalk
Schloss Birlinghoven, Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany
Sebastian Houben

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pavlitskaya, S., Hubschneider, C., Weber, M. (2022). Evaluating Mixture-of-Experts Architectures for Network Aggregation. In: Fingscheidt, T., Gottschalk, H., Houben, S. (eds) Deep Neural Networks and Data for Automated Driving. Springer, Cham. https://doi.org/10.1007/978-3-031-01233-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-01233-4_11
Published: 18 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-01232-7
Online ISBN: 978-3-031-01233-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Evaluating Mixture-of-Experts Architectures for Network Aggregation

Abstract

Similar content being viewed by others

Diversified Dynamic Routing for Vision Tasks

Uncertainty Estimates and Multi-hypotheses Networks for Optical Flow

ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation

1 Introduction

2 Related Works