1 Introduction

In barely a decade, deep neural networks (DNNs) have revolutionized the field of machine learning by reaching unprecedented, sometimes superhuman, performances on a growing variety of tasks. Many of these neural models have found their way into consumer applications like smart speakers, machine translation engines, or content feeds. However, in safety-critical systems, where human life might be at risk, the use of recent DNNs is challenging as various model-immanent insufficiencies are yet difficult to address.

This work summarizes the promising lines of research in how to identify, address, and at least partly mitigate these DNN insufficiencies. While some of the reviewed works are theoretically grounded and foster the overall understanding of training and predictive power of DNNs, others provide practical tools to adapt their development, training, or predictions. We refer to any such method as a safety mechanism if it addresses one or several safety concerns in a feasible manner. Their effectiveness in mitigating safety concerns is assessed by safety metrics [CNH+18, OOAG19, BGS+19, SS20a]. As most safety mechanisms target only a particular insufficiency, we conclude that a holistic safety argumentation [BGS+19, SSH20, SS20a, WSRA20] for complex DNN-based systems will in many cases rely on a variety of safety mechanisms.

We structure our review of these mechanisms as follows: Sect. 2 focuses on dataset optimization for network training and evaluation. It is motivated by the well-known fact that, in comparison to humans, DNNs perform poorly on data that is structurally different from training data. Apart from insufficient generalization capabilities of these models, the data acquisition process and distributional data shifts over time play vital roles. We survey potential counter-measures, e.g., augmentation strategies and outlier detection techniques.

Mechanisms that improve on robustness are described in Sects. 3 and 4, respectively. They deserve attention as DNNs are generally not resilient to common perturbations and adversarial attacks.

Section 5 addresses incomprehensible network behavior and reviews mechanisms that aim at explainability, e.g., a more transparent functioning of DNNs. This is particularly important from a safety perspective as interpretability might allow for tracing back model failure cases thus facilitating purposeful improvements.

Moreover, DNNs tend to overestimate their prediction confidence, especially on unseen data. Straightforward ways to estimate prediction confidence yield mostly unsatisfying results. Among others, this observation fueled research on more sophisticated uncertainty estimations (see Sect. 6), redundancy mechanisms (see Sect. 7), and attempts to reach formal verification as addressed in Sect. 8.

At last, many safety-critical applications require not only accurate but also near real-time decisions. This is covered by mechanisms on the DNN architectural level (see Sect. 9) and furthermore by compression and quantization methods (see Sect. 10).

We conclude this review of mechanism categories with an outlook on the steps to transfer a carefully arranged combination of safety mechanisms into an actual holistic safety argumentation.

2 Dataset Optimization

The performance of a trained model inherently relies on the nature of the underlying dataset. For instance, a dataset with poor variability will hardly result in a model ready for real-world applications. In order to approach the latter, data selection processes, such as corner case selection and active learning, are of utmost importance. These approaches can help to design datasets that contain the most important information, while preventing the so-much-desired information from getting lost in an ocean of data. For a given dataset and active learning setup, data augmentation techniques are very common aiming at extracting as much model performance out of the dataset as possible.

On the other hand, safety arguments also require the analysis of how a model behaves on out-of-distribution data, data that contains concepts the model has not encountered during training. This is quite likely to happen as our world is under constant change, in other words, exposed to a constantly growing domain shift. Therefore, these fields are lately gaining interest, also with respect to perception in automated driving.

In this section, we provide an overview over anomaly and outlier detection, active learning, domain shift, augmentation, and corner case detection. The highly relevant problem of obtaining statistically sufficient test data, even under the assumption of redundant and independent systems, will then be discussed in Chapter “Does Redundancy in AI Perception Systems Help to Test for Super-Human Automated Driving Performance?” [GRS22]. There it will be shown that neural networks trained on the same computer vision task show high correlation in error cases, even if training data and other design choices are kept independent. Using different sensor modalities, however, diminishes the problem to some extent.

2.1 Outlier/Anomaly Detection

The terms anomaly, outlier, and out-of-distribution (OoD) data detection are often used interchangeably in literature and refer to task of identifying data samples that are not representative of training data distribution. Uncertainty evaluation (cf. Sect. 6) is closely tied to this field as self-evaluation of models is one of the active areas of research for OoD detection. In particular, for image classification problems it has been reported that neural networks often produce high confidence predictions on OoD data [NYC15, HG17]. The detection of such OoD inputs can either be tackled by post-processing techniques that adjust the estimated confidence [LLS18, DT18] or by enforcing low confidence on OoD samples during training [HAB19, HMD19]. Even guarantees that neural networks produce low confidence predictions for OoD samples can be provided under specific assumptions (cf. [MH20b]). More precisely, this work utilizes Gaussian mixture models that, however, may suffer from high-dimensional data and require strong assumptions on the distribution parameters. Some approaches use generative models like GANs [SSW+17, AAAB18] and autoencoders [ZP17] for outlier detection. The models are trained to learn in-distribution data manifolds and will produce higher reconstruction loss for outliers.

For OoD detection in semantic segmentation, only a few works have been presented so far. Angus et al. [ACS19] present a comparative study of common OoD detection methods, which mostly deal with image-level classification. In addition, they provide a novel setup of relevant OoD datasets for this task. Another work trains a fully convolutional binary classifier that distinguishes image patches from a known set of classes from image patches stemming from an unknown class [BKOŠ18]. The classifier output applied at every pixel will give the per-pixel confidence value for an OoD object. Both of these works perform at pixel level and without any sophisticated feature generation methods specifically tailored for the detection of entire OoD instances. Up to now, outlier detection has not been studied extensively for object detection tasks. In [GBA+19], two CNNs are used to perform object detection and binary classification (benign or anomaly) in a sequential fashion, where the second CNN takes the localized object within the image as input. From a safety standpoint, detecting outliers or OoD samples is extremely important and beneficial as training data cannot realistically be large enough to capture all situations. Research in this area is heavily entwined with progress in uncertainty estimation (cf. Sect. 6) and domain adaptation (cf. Sect. 2.3). Extending research works to segmentation and object detection tasks would be particularly significant for leveraging automated driving research. In addition to safety, OoD detection can be beneficial in other aspects, e.g., when using local expert models. Here, an expert model for segmentation of urban driving scenes and another expert model for segmentation of highway driving scenes can be deployed in parallel, where an additional OoD detector could act as a trigger on which models can be switched. We extend this discussion in Chapter “Detecting and Learning the Unknown in Semantic Segmentation” [CURG22], where we investigate the handling of unknown objects in semantic segmentation. In the scope of semantic segmentation, we detect anomalous objects via high-entropy responses and perform a statistical analysis over these detections to suggest new semantic categories.

With respect to the approaches presented above, uncertainty-based and generative-model-based OoD detection methods are currently promising directions of research. However, it remains an open question whether they can unfold their potential well on segmentation and object detection tasks.

2.2 Active Learning

It is widely known that, as a rule of thumb, for the training of any kind of artificial neural network, an increase of training data leads to increased performance. Obtaining labeled training data, however, is often very costly and time-consuming. Active learning provides one possible remedy to this problem: Instead of labeling every data point, active learning utilizes a query strategy to request labels from a teacher/an oracle which leverage the model performance most. The survey paper by Settles [Set10] provides a broad overview regarding query strategies for active learning methods. However, except for uncertainty sampling and query by committee, most of them seem to be infeasible in deep learning applications up to now. Hence, most of the research activities in active deep learning focus on these two query strategies, as we outline in the following.

It has been shown for image classification [GIG17, RKG18] that labels corresponding to uncertain samples can leverage the network’s performance significantly and that a combination with semi-supervised learning is promising. In both works, uncertainty of unlabeled samples is estimated via Monte Carlo (MC) dropout inference. MC dropout inference and a chosen number of training epochs are executed alternatingly, after performing MC dropout inference, the unlabeled samples’ uncertainties are assessed by means of sample-wise dispersion measures. Samples for which the DNN model is very uncertain about its prediction are presented to an oracle and labeled.

With respect to object detection, a moderate number of active learning methods has been introduced [KLSL18, RUN18, DCG+19, BKD19]. These approaches include uncertainty sampling [KLSL18, BKD19] and query-by-committee methods [RUN18]. In [KLSL18, DCG+19], additional algorithmic features specifically tailored for object detection networks are presented, i.e., separate treatment of the localization and classification loss [KLSL18], as well as weak and strong supervision schemes [DCG+19]. For semantic segmentation, an uncertainty-sampling-based approach has been presented [MLG+18], which queries polygon masks for image sections of a fixed size (\(128 \times 128\)). Queries are performed by means of accumulated entropy in combination with a cost estimation for each candidate image section. Recently, new methods for estimating the quality of a prediction [DT18, RCH+20] as well as new uncertainty quantification approaches, e.g., gradient-based ones [ORG18], have been proposed. It remains an open question whether they are suitable for active learning. Since most of the conducted studies are rather of academic nature, also their applicability to real-life data acquisition is not yet demonstrated sufficiently. In particular, it is not clear whether the proposed active learning schemes, including the label acquisition, for instance, in semantic segmentation, is suitable to be performed by human labelers. Therefore, labeling acquisition with a common understanding of the labelers’ convenience and suitability for active learning are a promising direction for research and development.

2.3 Domains

The classical assumption in machine learning is that the training and testing datasets are drawn from the same distribution, implying that the model is deployed under the same conditions as it was trained under. However, as it is mentioned in [MTRA+12, JDCR12], for real-world applications this assumption is often violated in the sense that the training and the testing set stems from different domains having different distributions. This poses difficulties for statistical models and the performance will mostly degrade when they are deployed on a domain \(\mathcal {D}^{\mathrm {test}}\), having a different distribution than the training dataset (i.e., generalizing from the training to the testing domain is not possible). This makes the study of domains not only relevant from the machine learning perspective, but also from a safety point of view.

More formally, there are differing notions of a “domain” in literature. For [Csu17, MD18], a domain \(\mathcal {D} = \{\mathcal {X}, P(\mathbf {x}) \}\) consists of a feature space \(\mathcal {X} \subset \mathbb {R}^d\) together with a marginal probability distribution \(P(\mathbf {x})\) with \(\mathbf {x}\in \mathcal {X}\). In [BCK+07, BDBC+10], a domain is a pair consisting of a distribution over the inputs together with a labeling function. However, instead of a sharp labeling function, it is also widely accepted to define a (training) domain \(\mathcal {D}^{\mathrm {train}} = \{(\mathbf {x}_i,\overline{y}_i)\}_{i=1}^n\) to consist of n (labeled) samples that are sampled from a joint distribution \(P(\mathbf {x},\overline{y})\) (cf. [LCWJ18]).

The reasons for distributional shift are diverse—as are the names to indicate a shift. For example, if the rate of (class) images of interest is different between training and testing set this can lead to a domain gap and, result in differing overall error rates. Moreover, as Chen et al. [CLS+18] mention, changing weather conditions and camera setups in cars lead to a domain mismatch in applications of autonomous driving. In biomedical image analysis, different imaging protocols and diverse anatomical structures can hinder generalization of trained models (cf. [KBL+17, DCO+19]). Common terms to indicate distributional shift are domain shift, dataset shift, covariate shift, concept drift, domain divergence, data fracture, changing environments, or dataset bias. References [Sto08, MTRA+12] provide an overview. Methods and measures to overcome the problem of domain mismatch between one or more (cf. [ZZW+18]) source domains and target domain(s) and the resulting poor model performance are studied in the field of transfer learning and, in particular, its subtopic domain adaptation (cf. [MD18]). For instance, adapting a model that is trained on synthetically generated data to work on real data is one of the core challenges, as can be seen [CLS+18, LZG+19, VJB+19]. Furthermore, detecting when samples are out-of-domain or out-of-distribution is an active field of research (cf. [LLLS18] and Sect. 2.1 as well as Sect. 8.2 for further reference). This is particularly relevant for machine learning models that operate in the real world: if an automated vehicle encounters some situation that deviates strongly from what was seen during training (e.g., due to some special event like a biking competition, carnival, etc.) this can lead to wrong predictions and thereby potential safety issues if not detected in time.

In Chapter “Analysis and Comparison of Datasets by Leveraging Data Distributions in Latent Spaces” [SRL+22], a new technique to automatically assess the discrepancy of the domains of different datasets is proposed, including domain shift within one target application. The aptitude of encodings generated by different machine-learned models on a variety of automotive datasets is considered. In particular, loss variants of the variational autoencoder that enforce disentangled latent space representations yield promising results in this respect.

2.4 Augmentation

Given the need for big amounts of data to train neural networks, one often runs into a situation where data is lacking. This can lead to insufficient generalization and an overfitting to the training data. An overview over different techniques to tackle this challenge can be found in [KGC17]. One approach to try and overcome this issue is the augmentation of data. It aims at optimizing available data and increasing its amount, curating a dataset that represents a wide variety of possible inputs during deployment. Augmentation can as well be of help when having to work with a heavily unbalanced dataset by creating more samples of underrepresented classes. A broad survey on data augmentation is provided by [SK19]. They distinguish between two general approaches to data augmentation with the first one being data warping augmentations that focus on taking existing data and transforming it in a way that does not affect labels. The other options are oversampling augmentations, which create synthetic data that can be used to increase the size of the dataset.

Examples of some of the most basic augmentations are flipping, cropping, rotating, translating, shearing, and zooming. These are affecting the geometric properties of the image and are easily implemented [SK19]. The machine learning toolkit Keras, for example, provides an easy way of applying them to data using their ImageDataGenerator class [C+15]. Other simple methods include adaptations in color space that affect properties such as lighting, contrast, and tints, which are common variations within image data. Filters can be used to control increased blur or sharpness [SK19]. In [ZZK+20], random erasing is introduced as a method with similar effect as cropping, aiming at gaining robustness against occlusions. An example for mixing images together as an augmentation technique can be found in [Ino18].

The abovementioned methods have in common that they work on the input data but there are different approaches that make use of deep learning for augmentation. An example for making augmentations in feature space using autoencoders can be found in [DT17]. They use the representation generated by the encoder and generate new samples by interpolation and extrapolation between existing samples of a class. The lack of interpretability of augmentations in feature space in combination with the tendency to perform worse than augmentations in image space presents open challenges for those types of augmentations [WGSM17, SK19]. Adversarial training is another method that can be used for augmentation. The goal of adversarial training is to discover cases that would lead to wrong predictions. That means the augmented images won’t necessarily represent samples that could occur during deployment but that can help in achieving more robust decision boundaries [SK19]. An example of such an approach can be found in [LCPB18]. Generative modeling can be used to generate synthetic samples that enlarge the dataset in a useful way with GANs, variational autoencoders and the combination of both are important tools in this area [SK19]. Examples for data augmentation in medical context using a CycleGAN [ZPIE17] can be found in [SYPS19] and using a progressively growing GAN [KALL18] in [BCG+18]. Next to neural style transfer [GEB15] that can be used to change the style of an image to a target style, AutoAugment [CZM+19] and population-based augmentation [HLS+19] are two more interesting publications. In both, the idea is to search a predefined search space of augmentations to gather the best selection.

The field of augmenting datasets with purely synthetic images, including related work, is addressed in Chapter “Optimized Data Synthesis for DNN Training and Validation by Sensor Artifact Simulation” [HG22], where a novel approach to apply realistic sensor artifacts to given synthetic data is proposed. The better overall quality is demonstrated via established per-image metrics and a domain distance measure comparing entire datasets. Exploiting this metric as optimization criterion leads to an increase in performance for the DeeplabV3+ model as demonstrated on the Cityscapes dataset.

2.5 Corner Case Detection

Ensuring that AI-based applications behave correctly and predictably even in unexpected or rare situations is a major concern that gains importance especially in safety-critical applications such as autonomous driving. In the pursuit of more robust AI corner cases play an important role. The meaning of the term corner case varies in the literature. Some consider mere erroneous or incorrect behavior as corner cases [PCYJ19, TPJR18, ZHML20]. For example, in [BBLFs19] corner cases are referred to as situations in which an object detector fails to detect relevant objects at relevant locations. Others characterize corner cases mainly as rare combinations of input parameter values [KKB19, HDHH20]. This project adopts the first definition: inputs that result in unexpected or incorrect behavior of the AI function are defined as corner cases. Contingent on the hardware, the AI architecture and the training data, the search space of corner cases quickly becomes incomprehensibly large. While manual creation of corner cases (e.g., constructing or re-enacting scenarios) might be more controllable, approaches that scale better and allow for a broader and more systematic search for corner cases require extensive automation.

One approach to automatic corner case detection is based on transforming the input data. The DeepTest framework [TPJR18] uses three types of image transformations: linear, affine, and convolutional transformations. In addition to these transformations, metamorphic relations help detect undesirable behaviors of deep learning systems. They allow changing the input while asserting some characteristics of the result [XHM+11]. For example, changing the contrast of input frames should not affect the steering angle of a car [TPJR18]. Input–output pairs that violate those metamorphic relations can be considered as corner cases.

Among other things, the white-box testing framework DeepXplore [PCYJ19] applies a method called gradient ascent to find corner cases (cf. Sect. 8.1). In the experimental evaluation of the framework, three variants of deep learning architectures were used to classify the same input image. The input image was then changed according to the gradient ascent of an objective function that reflected the difference in the resulting class probabilities of the three model variants. When the changed (now artificial) input resulted in different class label predictions by the model variants, the input was considered as a corner case.

In [BBLFs19], corner cases are detected on video sequences by comparing predicted frames with actual frames. The detector has three components: the first component, semantic segmentation, is used to detect and locate objects in the input frame. As the second component, an image predictor trained on frame sequences predicts the actual frame based on the sequence preceding that frame. An error is determined by comparing the actual with the predicted (i.e., expected) frame, following the idea that only situations that are unexpected for AI-based perception functions may be potentially dangerous and therefore a corner case. Both the segmentation and the prediction error are then fed into the third component of the detector, which determines a corner case score that reflects the extent to which unexpected relevant objects are at relevant locations.

In [HDHH20], a corner case detector based on simulations in a CARLA environment [DRC+17] is presented. In the simulated world, AI agents control the vehicles. During simulations, state information of both the environment and the AI agents are fed into the corner case detector. While the environment provides the real vehicle states, the AI agents provide estimated and perceived state information. Both sources are then compared to detect conflicts (e.g., collisions). These conflicts are recorded for analysis. Several ways of automatically generating and detecting corner cases exist. However, corner case detection is a task with challenges of its own: depending on the operational design domain including its boundaries, the space of possible inputs can be very large. Also, some types of corner cases are specific to the AI architecture, e.g., the network type or the network layout used. Thus, corner case detection has to assume a holistic point of view on both model and input, adding further complexity and reducing transferability of previous insights.

Although it can be argued that rarity does not necessarily characterize corner cases, rare input data might have the potential of challenging the AI functionality (cf. Sect. 2.1). Another research direction could investigate whether structuring the input space in a way suitable for the AI functionality supports the detection of corner cases. Provided that the operational design domain is conceptualized as an ontology, ontology-based testing [BMM18] may support automatic detection. A properly adapted generator may specifically select promising combinations of extreme parameter values and, thus, provide valuable input for synthetic test data generation.

3 Robust Training

Recent works [FF15, RSFD16, BRW18, AW19, HD19, ETTS19, BHSFs19] have shown that state-of-the-art deep neural networks (DNNs) performing a wide variety of computer vision tasks, such as image classification [KSH12, HZRS15, MGR+18], object detection [Gir15, RDGF15, HGDG17], and semantic segmentation [CPSA17, ZSR+19, WSC+20, LBS+19], are not robust to small changes in the input.

Robustness of neural networks is an active and open research field that can be considered highly relevant for achieving safety in automated driving. Currently, most of the research is directed toward either improving adversarial robustness [SZS+14] (robustness against carefully designed perturbations that aim at causing misclassifications with high confidence) or improving corruption robustness [HD19] (robustness against commonly occurring augmentations such as weather changes, addition of Gaussian noise, photometric changes, etc.). While adversarial robustness might be more of a security issue than a safety issue, corruption robustness, on the other hand, can be considered highly safety-relevant.

Equipped with these definitions, we broadly term robust training here as methods or mechanisms that aim at improving either adversarial or corruption robustness of a DNN, by incorporating modifications into the architecture or into the training mechanism itself.

In this section, we cover three widespread techniques for fostering robustness during model training: hyperparameter optimization, modification of loss, and domain generalization. Additionally, in Chapter “Improved DNN Robustness by Multi-Task Training With an Auxiliary Self-Supervised Task” [KFs22], an approach for robustification via multi-task training is presented. Semantic segmentation is combined with the additional target of depth estimation and is proven to show increased robustness on the Cityscapes and KITTI datasets.

A useful metric to assess the training regimen and final quality of a neural network is presented in Chapter “The Good and the Bad: Using Neuron Coverage as a DNN Validation Technique” [GAHW22]. The use of different forms of neuron coverage is discussed and juxtaposed with pair-wise coverage on a tractable example that is being developed.

3.1 Hyperparameter Optimization

The final performance of a neural network highly depends on the learning process. The process includes the actual optimization and may additionally introduce training methods such as dropout, regularization, or parametrization of a multi-task loss.

These methods adapt their behavior for predefined parameters. Hence, their optimal configuration is a priori unknown. We refer to them as hyperparameters. Important hyperparameters comprise, for instance, the initial learning rate, steps for learning-rate reduction, learning-rate decay, momentum, batch size, dropout rate, and number of iterations. Their configuration has to be determined according to the architecture and task of the CNN [FH19]. The search of an optimal hyperparameter configuration is called hyperparameter optimization (HO).

HO is usually described as an optimization problem [FH19]. Thereby, the combined configuration space is defined as \(\Lambda = \lambda _1 \times \lambda _2 \times \cdots \times \lambda _N\), according to each domain \(\lambda _n\). Their individual spaces can be continuous, discrete, categorical, or binary.

Hence, one aims to find an optimal hyperparameter configuration \(\boldsymbol{\lambda }^{\star }\) by minimizing an objective function \(\mathcal {O}\left( \cdot \right) \), which evaluates a trained model \(\mathbf {F}\) having parameters \(\boldsymbol{\theta }\) on the validation dataset \(\mathcal {D}^{\text {val}}\) with the loss J:

$$\begin{aligned} \boldsymbol{\lambda }^{\star } = \mathop {\mathrm {arg \,min}}_{\boldsymbol{\lambda } \in \boldsymbol{\Lambda }} \; \mathcal {O} \left( J, \mathbf {F}, \boldsymbol{\theta }, \mathcal {D}^{\text {train}}, \mathcal {D}^{\text {val}} \right) . \end{aligned}$$
(1)

This problem statement is widely regarded in traditional machine learning and primarily based on Bayesian optimization (BO) in combination with Gaussian processes. However, a straightforward application to deep neural networks encounters problems due to a lack of scalability, flexibility, and robustness [FKH18, ZCY+19]. To exploit the benefits of BO, many authors proposed different combinations with other approaches. Hyperband [LJD+18] in combination with BO (BOHB) [FKH18] frames the optimization as “[...] a pure exploration non-stochastic infinite-armed bandit problem [...]”. The method of BO for iterative learning (BOIL) [NSO20] iteratively internalizes collected information about the learning curve and the learning algorithm itself. The authors of [WTPFW19] introduce the trace-aware knowledge gradient (taKG) as an acquisition function for BO (BO-taKG) which “leverages both trace information and multiple fidelity controls”. Thereby BOIL and BO-taKG achieve state-of-the-art performance regarding CNNs outperforming hyperband.

Other approaches such as the orthogonal array tuning method (OATM) [ZCY+19] or HO by reinforcement learning (Hyp-RL) [JGST19] turn away from the Bayesian approaches and offer new research directions.

Finally, the insight that many authors include kernel sizes and number of kernels and layers in their hyperparameter configuration should be emphasized. More work should be spent on the distinct integration of HO in the performance estimation strategy of neural architecture search (cf. Sect. 9.3).

3.2 Modification of Loss

There exist many approaches that aim at directly modifying the loss function with an objective of improving either adversarial or corruption robustness [PDZ18, KS18, LYZZ19, XCKW19, SLC19, WSC+20, SR20]. One of the earliest approaches for improving corruption robustness was introduced by Zheng et al. [ZSLG16] and is called stability training, where they introduce a regularization term that penalizes the network prediction to a clean and an augmented image. However, their approach does not scale to many augmentations at the same time. Janocha et al. [JC17] then introduced a detailed analysis on the influence of multiple loss functions to model performance as well as robustness and suggested that expectation-based losses tend to work better with noisy data and squared-hinge losses tend to work better for clean data. Other well-known approaches are mainly based on variations of data augmentation [CZM+19, CZSL19, ZCG+19, LYP+19], which can be computationally quite expensive.

In contrast to corruption robustness, there exist many more approaches based on adversarial examples. We highlight some of the most interesting and relevant ones here. Mustafa et al. [CWL+19] propose to add a loss term that maximally separates class-wise feature map representations, hence increasing the distance from data points to the corresponding decision boundaries. Similarly, Pang et al. [PXD+20] proposed the Max-Mahalanobis center (MMC) loss to learn more structured representations and induce high-density regions in the feature space. Chen et al. [CBLR18] proposed a variation of the well-known cross-entropy (CE) loss that not only maximizes the model probabilities of the correct class, but, in addition, also minimizes model probabilities of incorrect classes. Cisse et al. [CBG+17] constraint the Lipschitz constant of different layers to be less than one which restricts the error propagation introduced by adversarial perturbations to a DNN. Dezfooli et al. [MDFUF19] proposed to minimize the curvature of the loss surface locally around data points. They emphasize that there exists a strong correlation between locally small curvature and correspondingly high adversarial robustness.

All of these methods highlighted above are evaluated mostly for image classification tasks on smaller datasets, namely, CIFAR-10 [Kri09], CIFAR-100 [Kri09], SVHN [NWC+11], and only sometimes on ImageNet [KSH12]. Very few approaches have been tested rigorously on complex safety-relevant tasks, such as object detection, semantic segmentation, etc. Moreover, methods that improve adversarial robustness are only tested on a small subset of attack types under differing attack specifications. This makes comparing multiple methods difficult.

In addition, methods that improve corruption robustness are evaluated over a standard dataset of various corruption types which may or may not be relevant to its application domain. In order to assess multiple methods for their effect on safety-related aspects, a thorough robustness evaluation methodology is needed, which is largely missing in the current literature. This evaluation would need to take into account relevant disturbances/corruption types present in the real world (application domain) and have to assess robustness toward such changes in a rigorous manner. Without such an evaluation, we run into the risk of being overconfident in our network, thereby harming safety.

3.3 Domain Generalization

Domain generalization (DG) can be seen as an extreme case of domain adaptation (DA). The latter is a type of transfer learning, where the source and target tasks are the same (e.g., shared class labels) but the source and target domains are different (e.g., another image acquisition protocol or a different background) [Csu17, WYKN20]. The DA can be either supervised (SDA), where there is little available labeled data in the target domain, or unsupervised (UDA), where data in the target domain is not labeled. The DG goes one step further by assuming that the target domain is entirely unknown. Thus, it seeks to solve the train-test domain shift in general. While DA is already an established line of research in the machine learning community, DG is relatively new [MBS13], though with an extensive list of papers in the last few years.

Probably, the first intuitive solution that one may think of to implement DG is neutralizing the domain-specific features. It was shown in [WHLX19] that the gray-level co-occurrence matrices (GLCM) tend to perform poorly in semantic classification (e.g., digit recognition) but yield good accuracy in textural classification compared to other feature sets, such as speeded up robust features (SURF) and local binary patterns (LBP). DG was thus implemented by decorrelating the model’s decision from the GLCM features of the input image even without the need of domain labels.

Besides the aforementioned intensity-based statistics of an input image, it is known that characterizing image style can be done based on the correlations between the filter responses of a DNN layer [GEB16] (neural style transfer). In [SMK20], the training images are enriched with stylized versions, where a style is defined either by an external style (e.g., cartoon or art) or by an image from another domain. Here, DG is addressed as a data augmentation problem.

Some approaches [LTG+18, MH20a] try to learn generalizable latent representations by a kind of adversarial training. This is done by a generator or an encoder, which is trained to generate a hidden feature space that maximizes the error of a domain discriminator but at the same time minimizes the classification error of the task of concern. Another variant of adversarial training can be seen in [LPWK18], where an adversarial autoencoder [MSJ+16] is trained to generate features, which a discriminator cannot distinguish from random samples drawn from a prior Laplace distribution. This regularization prevents the hidden space from overfitting to the source domains, in a similar spirit to how variational autoencoders do not leave gaps in the latent space. In [MH20a], it is argued that the domain labels needed in such approaches are not always well-defined or easily available. Therefore, they assume unknown latent domains which are learned by clustering in a space similar to the style-transfer features mentioned above. The pseudo-labels resulting from clustering are then used in the adversarial training.

Autoencoders have been employed for DG not only in an adversarial setup, but also in the sense of multi-task learning nets [Car97], where the classification task in such nets is replaced by a reconstruction one. In [GKZB15], an autoencoder is trained to reconstruct not only the input image but also the corresponding images in the other domains.

In the core of both DA and DG, we are confronted with a distribution matching problem. However, estimating the probability density in high-dimensional spaces is intractable. The density-based metrics, such as Kullback–Leibler divergence, are thus not directly applicable. In statistics, the so-called two-sample tests are usually employed to measure the distance between two distributions in a point-wise manner, i.e., without density estimation. For deep learning applications, these metrics need not only to be point-wise but also differentiable. The two-sample tests were approached in the machine learning literature using (differentiable) K-NNs [DK17], classifier two-sample tests (C2ST) [LO17], or based on the theory of kernel methods [SGSS07]. More specifically, the maximum mean discrepancy (MMD) [GBR+06, GBR+12], which belongs to the kernel methods, is widely used for DA [GKZ14, LZWJ17, YDL+17, YLW+20] but also for DG [LPWK18]. Using the MMD, the distance between two samples is estimated based on pair-wise kernel evaluations, e.g., the radial basis function (RBF) kernel.

While the DG approaches generalize to domains from which zero shots are available, the so-called zero-shot learning (ZSL) approaches generalize to tasks (e.g., new classes in the same source domains) for which zero shots are available. Typically, the input in ZSL is mapped to a semantic vector per class instead of a simple class label. This can be, for instance, a vector of visual attributes [LNH14] or a word embedding of the class name [KXG17]. A task (with zero shots at training time) can be then described by a vector in this space. In [MARC20], there is an attempt to combine ZSL and DG in the same framework in order to generalize to new domains as well as new tasks, which is also referred to as heterogeneous domain generalization.

Note that most discussed approaches for DG require non-standard handling, i.e., modifications to models, data, and/or the optimization procedure. This issue poses a serious challenge as it limits the practical applicability of these approaches. There is a line of research which tries to address this point by linking DG to other machine learning paradigms, especially the model-agnostic meta-learning (MAML) [FAL17] algorithm, in an attempt to apply DG in a model-agnostic way. Loosely speaking, a model can be exposed to simulated train-test domain shift by training on a small support set to minimize the classification error on a small validation set. This can be seen as an instance of a few-shot learning (FSL) problem [WYKN20]. Moreover, the procedure can be repeated on other (but related) FSL tasks (e.g., different classes) in what is known as episodic training. The model transfers its knowledge from one task to another task and learns how to learn fast for new tasks. Thus, this can be seen as a meta-learning objective [HAMS20] (in a FSL setup). Since the goal of DG is to adapt to new domains rather than new tasks, several model-agnostic approaches [LYSH18, BSC18, LZY+19, DdCKG19] try to recast this procedure in a DG setup.

4 Adversarial Attacks

Over the last few years, deep neural networks (DNNs) consistently showed state-of-the-art performance across several vision-related tasks. While their superior performance on clean data is indisputable, they show a lack of robustness to certain input patterns, denoted as adversarial examples [SZS+14]. In general, an algorithm for creating adversarial examples is referred to as an adversarial attack and aims at fooling an underlying DNN, such that the output changes in a desired and malicious way. This can be carried out without any knowledge about the DNN to be attacked (black-box attack) [MDFF16, PMG+17], or with full knowledge about the parameters, architecture, or even training data of the respective DNN (white-box attack) [GSS15, CW17a, MMS+18]. While initially being applied on simple classification tasks, some approaches aim at finding more realistic attacks [TVRG19, JLS+20], which particularly pose a threat to safety-critical applications, such as DNN-based environment perception systems in autonomous vehicles. Altogether, this motivated the research in finding ways of defending against such adversarial attacks [GSS15, GRCvdM18, MDFUF19, XZZ+19]. In this section, we introduce the current state of research regarding adversarial attacks in general, more realistic adversarial attacks closely related to the task of environment perception for autonomous driving, and strategies for detecting or defending adversarial attacks. We conclude each section by clarifying current challenges and research directions.

4.1 Adversarial Attacks and Defenses

The term adversarial example was first introduced by Szegedy et al. [SZS+14]. From there on, many researchers tried to find new ways of crafting adversarial examples more effectively. Here, the fast gradient sign method (FGSM) [GSS15], DeepFool [MDFF16], least-likely class method (LLCM) [KGB17a, KGB17b], C&W [CW17b], momentum iterative fast gradient sign method (MI-FGSM) [DLP+18], and projected gradient descent (PGD) [MMS+18] are a few of the most famous attacks so far. In general, these attacks can be executed in an iterative fashion, where the underlying adversarial perturbation is usually bounded by some norm and is following additional optimization criteria, e.g., minimizing the number of changed pixels.

The mentioned attacks can be further categorized as image-specific attacks, where for each image a new perturbation needs to be computed. On the other hand, image-agnostic attacks aim at finding a perturbation, which is able to fool an underlying DNN on a set of images. Such a perturbation is also referred to as a universal adversarial perturbation (UAP). Here, the respective algorithm UAP [MDFFF17], fast feature fool (FFF) [MGB17], and prior driven uncertainty approximation (PD-UA) [LJL+19] are a few honorable mentions. Although the creation process of a universal adversarial perturbation typically relies on a white-box setting, they show a high transferability across models [HAMFs22]. This allows for black-box attacks, where one model is used to create a universal adversarial perturbation, and another model is being attacked with the beforehand-created perturbation. Universal adversarial perturbations are investigated in more detail in Chapter “Improving Transferability of Generated Universal Adversarial Perturbations for Image Classification and Segmentation” [HAMFs22], where a way to construct these perturbations on the task of image classification and segmentation is presented. In detail, the low-level feature extraction layer is attacked via an additional loss term to generate universal adversarial perturbations more effectively. Another way of designing black-box attacks is to create a surrogate DNN, which mimics the respective DNN to be attacked and thus can be used in the process of adversarial example creation [PMG+17]. On the contrary, some research has been done to create completely incoherent images (based on evolutionary algorithms or gradient ascent) to fool an underlying DNN [NYC15]. Different from that, another line of work has been proposed to alter only some pixels in images to attack a respective model. Here [NK17] has used optimization approaches to perturb some pixels in images to produce targeted attacks, aiming at a specific class output, or non-targeted attacks, aiming at outputting a class different from the network output or the ground truth. This can be extended up to finding one pixel in the image to be exclusively perturbed to generate adversarial images [NK17, SVS19]. The authors of [BF17, SBMC17, PKGB18] proposed to train generative models to generate adversarial examples. Given an input image and the target label, a generative model is trained to produce adversarial examples for DNNs. However, while the produced adversarial examples look rather unrealistic to a human, they are able to completely deceive a DNN.

The existence of adversarial examples not only motivated research in finding new attacks, but also in finding defense strategies to effectively defend these attacks. Especially for safety-critical applications, such as DNN-based environment perception for autonomous driving, the existence of adversarial examples needs to be handled accordingly. Similar to adversarial attacks, one can categorize defense strategies into two types: model-specific defense strategies and model-agnostic defense strategies. The former refers to defense strategies, where the model of interest is modified in certain ways. The modification can be done on the architecture, training procedure, training data, or model weights. On the other hand, model-agnostic defense strategies consider the model to be a black box. Here, only the input or the output is accessible. Some well-known model-specific defense strategies include adversarial training [GSS15, MMS+18], the inclusion of robustness-oriented loss functions during training [CLC+19, MDFUF19, KYL+20], removing adversarial patterns in features by denoising layers [HRF19, MKH+19, XWvdM+19], and redundant teacher-student frameworks [BHSFs19, BKV+20]. Majority of model-agnostic defense strategies primarily focus on various kinds of (gradient masking) pre-processing strategies [GRCvdM18, BFW+19, GR19, JWCF19, LLL+19, RSFM19, TCBZ19]. The idea is to remove the adversary from the respective image, such that the image is transformed from the adversarial space back into the clean space.

Nonetheless, Athalye et al. [ACW18] showed that gradient masking alone is not a sufficient criterion for a reliable defense strategy. In addition, detection and out-of-distribution techniques have also been proposed as model-agnostic defense strategies against adversarial attacks. Here, the Mahalanobis distance [LLLS18] or area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPR) [HG17] are used to detect adversarial examples. The authors of [HG17, LLLS17, MCKBF17], on the other hand, proposed to train networks to detect, whether the input image is out-of-distribution or not.

Moreover, Feinman et al. [FCSG17] proved that adversarial attacks usually produce high uncertainty on the output of the DNN. As a consequence, they proposed to use the dropout technique to estimate uncertainty on the output to identify a possible adversarial attack. Regarding adversarial attacks, majority of the listed attacks are designed for image classification. Only a few adversarial attacks consider tasks that are closely related to autonomous driving, such as bounding box detection, semantic segmentation, instance segmentation, or even panoptic segmentation. Also, the majority of the adversarial attacks rely on a white-box setting, which is usually not present for a potential attacker. Especially universal adversarial perturbations have to be considered as a real threat due to their high model transferability. Generally speaking, the existence of adversarial examples has not been thoroughly studied yet. An analytical interpretation is still missing, but could help in designing more mature defense strategies.

Regarding defense strategies, adversarial training is still considered as one of the most effective ways of increasing the robustness of a DNN. Nonetheless, while adversarial training is indeed effective, it is rather inefficient in terms of training time. In addition, model-agnostic defenses should be favored as once being designed, they can be easily transferred to different models. Moreover, as most model-agnostic defense strategies rely on gradient-masking and it has been shown that gradient-masking is not a sufficient property for a defense strategy, new ways of designing model-agnostic defenses should be taken into account. Furthermore, out-of-distribution and adversarial attacks detection or even correction methods have been a new trend for identifying attacks. However, as the environment perception system of an autonomous driving vehicle could rely on various information sources, including LiDAR, optical flow, or depth from a stereo camera, techniques of information fusion should be further investigated to mitigate or even eliminate the effect of adversarial examples.

4.2 More Realistic Attacks

We consider the following two categories of realistic adversarial attacks: (1) image-level attacks, which not only fool a neural network but also pose a provable threat to autonomous vehicles and (2) attacks which have been applied in a real world or in a simulation environment, such as car learning to act (CARLA) [DRC+17].

Some notable examples in the first category of attacks include attacks on semantic segmentation [MCKBF17] or person detection [TVRG19].

In the second group of approaches, the attacks are specifically designed to survive real-world distortions, including different distances, weather and lighting conditions, as well as camera angles. For this, adversarial perturbations are usually concentrated in a specific image area, called adversarial patch. Crafting an adversarial patch involves specifying a patch region in each training image, applying transformations to the patch, and iteratively changing the pixel values within this region to maximize the network prediction error. The latter step typically relies on an algorithm, proposed for standard adversarial attacks, which aim at crafting invisible perturbations while misleading neural networks, e.g., C&W [CW17b], Jacobian-based saliency map attack (JSMA) [PMG+17], and PGD [MMS+18].

The first printable adversarial patch for image classification was described by Brown et al. [BMR+17]. Expectation over transformations (EOT) [AEIK18] is one of the influential updates to the original algorithm—it permits to robustify patch-based attacks to distortions and affine transformations. Localized and visible adversarial noise (LaVAN) [KZG18] is a further method to generate much smaller patches (up to 2% of the pixels in the image). In general, fooling image classification with a patch is a comparatively simple task, because adversarial noise can mimic an instance of another class and thus lower the prediction probability for a true class.

Recently, patch-based attacks for a more tricky task of object detection have been described [LYL+19, TVRG19]. Also, Lee and Kolter [LK19] generate a patch using PGD [MMS+18], followed by EOT applied to the patch. With this approach, all detections in an image can be successfully suppressed, even without any overlap of a patch with bounding boxes. Furthermore, several approaches for generating an adversarial T-shirt have been proposed, including [XZL+20, WLDG20].

DeepBillboard [ZLZ+20] is the first attempt to attack end-to-end driving models with adversarial patches. The authors propose to generate a single patch for a sequence of input images to mislead four steering models, including DAVE-2 in a drive-by scenario.

Apart from physical feasibility, inconspicuousness is crucial for a realistic attack. Whereas adversarial patches usually look like regions of noise, several works have explored attacks with an inconspicuous patch. In particular, Eykholt et al. [EEF+18] demonstrate the vulnerability of road sign classification to the adversarial perturbations in the form of only black and white stickers. In [BHG+19], an end-to-end driving model is attacked in CARLA by painting of black lines on the road. Also, Kong and Liu [KGLL20] use a generative adversarial network to get a realistic billboard to attack an end-to-end driving model in a drive-by scenario. In [DMW+20], a method to hide visible adversarial perturbations with customized styles is proposed, which leads to adversarial traffic signs that look unsuspicious to a human. Current research mostly focuses on attacking image-based perception of an autonomous vehicle. Adversarial vulnerability of further components of an autonomous vehicle, e.g., LiDAR-based perception, optical flow, and depth estimation, has only recently gained attention. Furthermore, most attacks consider only a single component of an autonomous driving pipeline, the question whether the existing attacks are able to propagate to further pipeline stages has not been studied yet. The first work in this direction [JLS+20] describes an attack on object detection and tracking. The evaluation is, however, limited to a few clips, where no experiments in the real world have been performed. Overall, the research on realistic adversarial attacks, especially combined with physical tests, is currently in the starting phase.

5 Interpretability

Neural networks are, by their nature, black boxes and therefore intrinsically hard to interpret [Tay06]. Due to their unrivaled performance, they still remain first choice for advanced systems even in many safety-critical areas, such as level 4 automated driving. This is why the research community has invested considerable effort to unhinge the black-box character and make deep neural networks more transparent.

We can observe three strategies that provide different viewpoints toward this goal in the state of the art. First is the most direct approach of opening up the black box and looking at intermediate representations. Being able to interpret individual layers of the system facilitates interpretation of the whole. The second approach tries to provide interpretability by explaining the network’s decisions with pixel attributions. Aggregated explanations of decisions can then lead to interpretability of the system itself. Third is the idea of approximating the network with interpretable proxies to benefit from the deep neural networks performance while allowing interpretation via surrogate models. Underlying all aspects here is the area of visual analytics.

There exists earlier research in the medical domain to help human experts understand and convince them of machine learning decisions [CLG+15]. Legal requirements in the finance industry gave rise to interpretable systems that can justify their decisions. An additional driver for interpretability research was the concern for Clever Hans predictors [LWB+19].

5.1 Visual Analytics

Traditional data science has developed a huge tool set of automated analysis processes conducted by computers, which are applied to problems that are well defined in the sense that the dimensionality of input and output as well as the size of the dataset they rely on is manageable. For those problems that in comparison are more complex, the automation of the analysis process is limited and/or might not lead to the desired outcome. This is especially the case with unstructured data like image or video data in which the underlying information cannot directly be expressed by numbers. Rather, it needs to be transformed to some structured form to enable computers to perform some task of analysis. Additionally, with an ever-increasing amount of various types of data being collected, this “information overload” cannot solely be analyzed by automatic methods [KAF+08, KMT09].

Visual analytics addresses this challenge as “the science of analytical reasoning facilitated by interactive visual interfaces” [TC05]. Visual analytics therefore does not only focus on either computationally processing data or visualizing results but coupling both tightly with interactive techniques. Thus, it enables an integration of the human expert into the iterative visual analytics process: through visual understanding and human reasoning, the knowledge of the human expert can be incorporated to effectively refine the analysis. This is of particular importance, where a stringent safety argumentation for complex models is required. With the help of visual analytics, the line of argumentation can be built upon arguments that are understandable for humans. To include the human analyst efficiently into this process, a possible guideline is the visual analytics mantra by Keim: “Analyze first, show the important, zoom, filter and analyze further, details on demand” [KAF+08].Footnote 1

The core concepts of visual analytics therefore rely on well-designed interactive visualizations, which support the analyst in the tasks of, e.g., reviewing, understanding, comparing, and inferring not only the initial phenomenon or data but also the computational model and its results itself with the goal of enhancing the analytical process. Driven by various fields of application, visual analytics is a multidisciplinary field with a wide variety of task-oriented development and research. Recent work has been done in several areas: depending on the task, there exist different pipeline approaches to create whole visual analytics systems [WZM+16]; the injection of human expert knowledge into the process of determining trends and patterns from data is the focus of predictive visual analytics [LCM+17, LGH+17]; enabling the human to explore high-dimensional data [LMW+17] interactively and visually (e.g., via dimensionality reduction [SZS+17]) is a major technique to enhance the understandability of complex models (e.g., neural networks); the iterative improvement and the understanding of machine learning models is addressed by using interactive visualizations in the field of general machine learning [LWLZ17], or the other way round: using machine learning to improve visualizations and guidance based on user interactions [ERT+17]. Even more focused on the loop of simultaneously developing and refining machine learning models is the area of interactive machine learning, where the topics of interface design [DK18] and the importance of users [ACKK14, SSZ+17] are discussed. One of the current research directions is using visual analytics in the area of deep learning [GTC+18, HKPC18, CL18]. However, due to the interdisciplinarity of visual analytics, there are still open directions and ongoing research opportunities.

Especially in the domain of neural networks and deep learning, visual analytics is a relatively new approach in tackling the challenge of explainability and interpretability of those often called black boxes. To enable the human to better interact with the models, research is done in enhancing the understandability of complex deep learning models and their outputs with the use of proper visualizations. Other research directions attempt to achieve improving the trustability of the models, giving the opportunity to inspect, diagnose, and refine the model. Further, possible areas for research are online training processes and the development of interactive systems covering the whole process of training, enhancing, and monitoring machine learning models. Here, the approach of mixed guidance, where system-initiated guidance is combined with user-initiated guidance, is discussed among the visual analytics community as well. Another challenge and open question is creating ways of comparing models to examine which model yields a better performance, given specific situations and selecting or combining the best models with the goal of increasing performance and overall safety.

5.2 Intermediate Representations

In general, representation learning [BCV13] aims to extract lower dimensional features in latent space from higher dimensional inputs. These features are then used as an effective representation for regression, classification, object detection, and other machine learning tasks. Preferably, latent features should be disentangled, meaning that they represent separate factors found in the data that are statistically independent. Due to their importance in machine learning, finding meaningful intermediate representations has long been a primary research goal. Disentangled representations can be interpreted more easily by humans and can, for example, be used to explain the reasoning of neural networks [HDR18].

Among the longer known methods for extracting disentangled representations are principal component analysis (PCA) [FP78, JC16], independent component analysis [HO00], and nonnegative matrix factorization [BBL+07]. PCA is highly sensitive to outliers and noise in the data. Therefore, more robust algorithms were proposed. In [SBS12] already a small neural network was used as an encoder and the algorithm proposed in [FXY12] can deal with high-dimensional data. Some robust PCA algorithms are provided with analytical performance guarantees [XCS10, RA17, RL19].

A popular method for representation learning with deep networks is the variational autoencoder (VAE) [KW14]. An important generalization of the method is the \( \beta \)-VAE variant [HMP+17], which improved the disentanglement capability [FAA18]. Later analysis added to the theoretical understanding of \( \beta \)-VAE [BHP+18, SZYP19, KP20]. Compared to standard autoencoders, VAEs map inputs to a distribution, instead of mapping them to a fixed vector. This allows for additional regularization of the training to avoid overfitting and ensure good representations. In \( \beta \)-VAEs, the trade-off between reconstruction quality and disentanglement can be fine-tuned by the hyperparameter \( \beta \).

Different regularization schemes have been suggested to improve the VAE method. Among them are Wasserstein autoencoders [TBGS19, XW19], attribute regularization [PL20], and relational regularization [XLH+20]. Recently, a connection between VAEs and nonlinear independent component analysis was established [KKMH20] and then expanded [SRK20].

Besides VAEs, deep generative adversarial networks can be used to construct latent features [SLY15, CDH+16, MSJ+16]. Other works suggest centroid encoders [GK20] or conditional learning of Gaussian distributions [SYZ+21] as alternatives to VAEs. In [KWG+18], concept activation vectors are defined as being orthogonal to the decision boundary of a classifier. Apart from deep learning, entirely new architectures, such as capsule networks [SFH17], might be used to disassemble inputs.

While many different approaches for disentangling exist, the feasibility of the task is not clear yet and a better theoretical understanding is needed. The disentangling performance is hard to quantify, which is only feasible with information about the latent ground truth [EW18]. Models that overly rely on single directions, single neurons in fully connected networks, or single feature maps in CNNs have the tendency to overfit [MBRB18]. According to [LBL+19], unsupervised learning does not produce good disentangling and even small latent spaces do not reduce the sample complexity for simple tasks. This is in direct contrast to newer findings that show a decreased sample complexity for more complex visual downstream tasks [vLSB20]. So far, it is unclear if disentangling improves the performance of machine learning tasks.

In order to be interpretable, latent disentangled representations need to be aligned with human understandable concepts. In [EIS+19], training with adversarial examples was used and the learned representations were shown to be more aligned with human perception. For explainable AI, disentangling alone might not be enough to generate interpretable output, and additional regularization could be needed.

In Chapter “Invertible Neural Networks for Understanding Semantics of Invariances of CNN Representations” [REBO22], a new approach is presented, which aims at extracting invariant representations that are present in a trained network. To this end, invertible neural networks are deployed to recover these invariances and to map them to accessible semantic concepts. These concepts allow for both interpretation of an inner model representation and means to carefully manipulate them and examine the effect on the prediction.

5.3 Pixel Attribution

The non-linearity and complexity of DNNs allow them to solve perception problems, such as detecting a pedestrian, that cannot be specified in detail. At the same time, the automatic extraction of features given in an input image and the mapping to the respective prediction is counterintuitive and incomprehensible for humans, which makes it hard to argue safety for a neural-network-based perception task. Feature importance techniques are currently predominantly used to diagnose the causes of incorrect model behaviors [BXS+20]. So-called attribution maps are a visual technique to express the relationship between relevant pixels in the input image and the network’s prediction. Regions in an image that contain relevant features are highlighted accordingly. Attribution approaches mostly map to one of three categories.

Gradient-based and activation-based approaches (such as [SVZ14, SDBR14, BBM+15, MLB+17, SCD+20, STK+17, SGK19] among others) rely on the gradient of the prediction with respect to the input. Regions that were most relevant for the prediction are highlighted. Activation-based approaches relate the feature maps of the last convolutional layer to output classes.

Perturbation-based approaches [ZF14, FV17, ZCAW17, HMK+19] suggest manipulating the input. If the prediction changes significantly, the input may hold a possible explanation at least.

While gradient-based approaches are oftentimes faster in computation, perturbation-based approaches are much easier to interpret. As many studies have shown [STY17, AGM+20], there is still a lot of research to be done before attribution methods are able to robustly provide explanations for model predictions, in particular, erroneous behavior. One key difficulty is the lack of an agreed-upon definition of a good attribution map including important properties. Even between humans, it is hard to agree on what a good explanation is due to its subjective nature. This lack of ground truth makes it hard or even impossible to quantitatively evaluate an explanation method. Instead, this evaluation is done only implicitly. One typical way to do this is the axiomatic approach. Here a set of desiderata of an attribution method are defined, on which different attribution methods are then evaluated. Alternatively, different attribution methods may be compared by perturbing the input features starting with the ones deemed most important and measuring the drop in accuracy of the perturbed models. The best method will result into the greatest overall loss in accuracy as the number of inputs are omitted [ACÖG17]. Moreover, for gradient-based methods it is hard to assess if an unexpected attribution is caused by a poorly performing network or a poorly performing attribution method [FV17]. How to cope with negative evidence, i.e., the object was predicted because a contrary clue in the input image was missing, is an open research question. Additionally, most methods were shown on classification tasks until now. It remains to be seen how they can be transferred to object detection and semantic segmentation tasks. In the case of perturbation-based methods, the high computation time and single-image analysis inhibit widespread application.

5.4 Interpretable Proxies

Neural networks are capable of capturing complicated logical (cor)relations. However, this knowledge is encoded on a sub-symbolic level in the form of learned weights and biases, meaning that the reasoning behind the processing chain cannot be directly read out or interpreted by humans [CCO98]. To explain the sub-symbolic processing, one can either use attribution methods (cf. Sect. 5.3), or lift this sub-symbolic representation to a symbolic one [GBY+18], meaning a more interpretable one. Interpretable proxies or surrogate models try to achieve the latter: the DNN behavior is approximated by a model that uses symbolic knowledge representations. Symbolic representations can be linear models like local interpretable model-agnostic explanations (LIME) [RSG16] (proportionality), decision trees (if-then chains) [GBY+18], or loose sets of logical rules. Logical connectors can simply be AND and OR but also more general ones like at-least-M-of-N [CCO98]. The expressiveness of an approach refers to the logic that is used: Boolean-only versus first-order logic, and binary versus fuzzy logic truth values [TAGD98]. Other than attribution methods (cf. Sect. 5.3), these representations can capture combinations of features and (spatial) relations of objects and attributes. As an example, consider “eyes are closed” as explanation for “person asleep”: attribution methods only could mark the location of the eyes, dismissing the relations of the attributes [RSS18]. All mentioned surrogate model types (linear, set of rules) require interpretable input features in order to be interpretable themselves. These features must either be directly obtained from the DNN input or (intermediate) output, or automatically be extracted from the DNN representation. Examples for extraction are the super-pixeling used in LIME for input feature detection, or concept activation vectors [KWG+18] for DNN representation decoding.

Quality criteria and goals for interpretable proxies are [TAGD98]: accuracy of the standalone surrogate model on unseen examples, fidelity of the approximation by the proxy, consistency with respect to different training sessions, and comprehensibility measured by the complexity of the rule set (number of rules, number of hierarchical dependencies). The criteria are usually in conflict and need to be balanced: Better accuracy may require a more complex, thus less expressive set of rules.

Approaches for interpretable proxies differ in the validity range of the representations: Some aim for surrogates that are only valid locally around specific samples, like in LIME [RSG16] or in [RSS18] via inductive logic programming. Other approaches try to be more globally approximate aspects of the model behavior. Another categorization is defined by whether full access (white box), some access (gray box), or no access (black box) to the DNN internals is needed. One can further differentiate between post hoc approaches that are applied to a trained model, and approaches that try to integrate or enforce symbolic representations during training. Post hoc methods cover the wide field of rule extraction techniques for DNNs. The interested reader may refer to [AK12, Hai16]. Most white- and gray-box methods try to turn the DNN connections into if-then rules that are then simplified, like done in DeepRED [ZLMJ16]. A black-box example is validity interval analysis [Thr95], which refines or generalizes rules on input intervals, either starting from one sample or a general set of rules. Enforcement of symbolic representations can be achieved by enforcing an output structure that provides insights to the decision logic, such as textual explanations or a rich output structure allowing investigation of correlations [XLZ+18]. An older discipline for enforcing symbolic representations is the field of neural-symbolic learning [SSZ19]. The idea is based on a hybrid learning cycle in which a symbolic learner and a DNN iteratively update each other via rule insertion and extraction. The comprehensibility of global surrogate models suffers from the complexity and size of concurrent DNNs. Thus, stronger rule simplification methods are required [Hai16]. The alternative direction of local approximations mostly concentrates on linear models instead of more expressive rules [Thr95, RSS18]. Furthermore, balancing of the quality objectives is hard since available indicators for interpretability may not be ideal. And lastly, applicability is heavily infringed by the requirement of interpretable input features. These are usually not readily available from input (often pixel level) or DNN output. Supervised extraction approaches vary in their fidelity, and unsupervised ones do not guarantee to yield meaningful or interpretable results, respectively, such as the super-pixel clusters of LIME.

6 Uncertainty

Uncertainty refers to the view that a neural network is not conceived as a deterministic function but as a probabilistic function or estimator, delivering a random distribution for each input point. Ideally, the mean value of the distribution should be as close as possible to the ground-truth value of the function being approximated by the neural network and the uncertainty of the neural network refers to its variance when being considered as a random variable, thus allowing to derive a confidence with respect to the mean value. Regarding safety, the variance may lead to estimations about the confidence associated with a specific network output and opens the option for discarding network outputs with insufficient confidence.

There are roughly two broad approaches for training neural networks as probabilistic functions: parametric approaches [KG17] and Bayesian neural networks on the one hand, such as [BCKW15], where the transitions along the network edges are modeled as probability distributions, and ensemble-based approaches on the other hand [LPB17, SOF+19], where multiple networks are trained and considered as samples of a common output distribution. Apart from training as probabilistic function, uncertainty measures have been derived from single, standard neural networks by post-processing on the trained network logits, leading, for example, to calibration measures (cf. e.g., [SOF+19]).

6.1 Generative Models

Generative models belong to the class of unsupervised machine learning models. From a theoretical perspective, these are particularly interesting, because they offer a way to analyze and model the density of data. Given a finite dataset \(\mathcal {D}\) independently distributed according to some distribution \(p(\mathbf {x}), \mathbf {x}\in \mathcal {D}\), generative models aim to estimate or enable sampling from the underlying density \(p(\mathbf {x})\) in a model \(\mathbf {F}(\mathbf {x},\boldsymbol{\theta })\). The resulting model can be used for data indexing [Wes04], data retrieval [ML11], for visual recognition [KSH12], speech recognition and generation [HDY+12], language processing [KM02, CE17], and robotics [Thr02]. Following [OE18], we can group generative models into two main classes:

  • Cost function-based models, such as autoencoder [KW14, Doe16], deep belief networks [Hin09], and generative adversarial networks [GPAM+14, RMC15, Goo17].

  • Energy-based models [LCH+07, SH09], where the joint probability density is modeled by an energy function.

Besides these deep learning approaches, generative models have been studied in machine learning in general for quite some time (cf. [Fry77, Wer78, Sil86, JMS96, She04, Sco15, Gra18]). A very prominent example of generative networks is Gaussian process [WR96, Mac98, WB98, Ras03] and their deep learning extensions [DL13, BHLHL+16] as generative models.

An example of a generative model being employed for image segmentation uncertainty estimation is the probabilistic U-Net [KRPM+18]. Here a variational autoencoder (VAE) conditioned on the image is trained to model uncertainties. Samples from the VAE are fed into a segmentation U-Net which can thus give different results for the same image. This was tested in context of medical images, where inter-rater disagreements lead to uncertain segmentation results and Cityscapes segmentation. For the Cityscapes segmentation, the investigated use case was label ambiguity (e.g., is a BMW X7 a car or a van) using artificially created, controlled ambiguities. Results showed that the probabilistic U-Net could reproduce the segmentation ambiguity modes more reliably than competing methods, such as a dropout U-Net which is based on techniques elaborated in the next section.

6.2 Monte Carlo Dropout

A widely used technique to estimate model uncertainty is Monte Carlo (MC) dropout [GG16] that offers a Bayesian motivation, conceptual simplicity, and scalability to application-size networks. This combination distinguishes MC dropout from competing Bayesian neural network (BNN) approximations (e.g., [RBB18, BCKW15], see Sect. 6.3). However, these approaches and MC dropout share the same goal: to equip neural networks with a self-assessment mechanism that detects unknown input concepts and thus potential model insufficiencies.

On a technical level, MC dropout assumes prior distributions on network activations, usually independent and identically distributed (i.i.d.) Bernoulli distributions. Model training with iteratively drawn Bernoulli samples, the so-called dropout masks, then yields a data-conditioned posterior distribution within the chosen parametric family. It is interesting to note that this training scheme was used earlier—independent from an uncertainty context—for better model generalization [SHK+14]. At inference, sampling provides estimates of the input-dependent output distributions. The spread of these distributions is then interpreted as the prediction uncertainty that originates from limited knowledge of model parameters. Borrowing “frequentist” terms, MC dropout can be considered as an implicit network ensemble, i.e., as a set of networks that share (most of) their parameters.

In practice, MC dropout requires only a minor modification of the optimization objective during training and multiple, trivially parallelizable forward passes during inference. The loss modification is largely agnostic to network architecture and does not cause substantial overhead. This is in contrast to the sampling-based inference that increases the computational effort massively—by estimated factors of 20–100 compared to networks without MC dropout. A common practice is therefore the use of last-layer dropout [SOF+19] that reduces computational overhead to estimated factors of 2–10. Alternatively, analytical moment propagation allows sampling-free MC dropout inference at the price of additional approximations (e.g., [PFC+19]). Further extensions of MC dropout target the integration of data-inherent (aleatoric) uncertainty [KG17] and tuned performance by learning layer-specific dropout rate using concrete relaxations [GHK17].

The quality of MC dropout uncertainties is typically evaluated using negative log-likelihood (NLL), expected calibration error (ECE) and its variants (cf. Sect. 6.6) and by considering correlations between uncertainty estimates and model errors (e.g., AUSE [ICG+18]). Moreover, it is common to study how useful uncertainty estimates are for solving auxiliary tasks like out-of-distribution classification [LPB17] or robustness w.r.t. adversarial attacks.

MC dropout is a working horse of safe ML, being used with various networks and for a multitude of applications (e.g., [BFS18]). However, several authors pointed out shortcomings and limitations of the method: MC dropout bears the risk of overconfident false predictions ([Osb16]), offers less diverse uncertainty estimates compared to (the equally simple and scalable) deep ensembles ([LPB17], see Sect. 7.1), and provides only rudimentary approximations of true posteriors.

Relaxing these modeling assumptions and strengthening the Bayesian motivation of MC dropout is therefore an important research avenue. Further directions for future work are the development of semantic uncertainty mechanisms (e.g., [KRPM+18]), improved local uncertainty calibrations, and a better understanding of the outlined sampling-free schemes to uncertainty estimation.

6.3 Bayesian Neural Networks

As the name suggests, Bayesian neural networks (BNNs) are inspired by a Bayesian interpretation of probability (for an introduction cf. [Mac03]). In essence, it rests on Bayes’ theorem,

$$\begin{aligned} p(a|b) p(b) = p(a,b) = p(b | a) p(a) \quad \Rightarrow \quad p(a| b) = \frac{ p(b | a) p(a)}{ p(b) } \,, \end{aligned}$$
(2)

stating that the conditional probability density function (PDF) p(a|b) for a given b may be expressed in terms of the inverted conditional PDF p(b|a). For machine learning, where one intends to make predictions y for unknown \(\mathbf {x}\) given some training data \(\mathcal {D}\), this can be reformulated into

$$\begin{aligned} y = {\text {NN}}(\mathbf {x}|\boldsymbol{\theta }) \quad \text {with}\quad p(\boldsymbol{\theta } | \mathcal {D}) = \frac{p(\mathcal {D} | \boldsymbol{\theta }) p(\boldsymbol{\theta })}{p(\mathcal {D})} \,. \end{aligned}$$
(3)

Therein NN denotes a conventional (deep) neural network (DNN) with model parameters \(\boldsymbol{\theta }\), e.g., the set of weights and biases. In contrast to a regular DNN, the weights are given in terms of a probability distribution \( p(\boldsymbol{\theta } | \mathcal {D}) \) turning also the output y of a BNN into a distribution. This allows to study the mean \(\mu =\langle y^1 \rangle \) of the DNN for a given \(\mathbf {x}\) as well as higher moments of the distribution, typically the resulting variance \(\sigma ^2 = \langle (y-\mu )^2 \rangle \) is of interest, where

$$\begin{aligned} \left\langle y^k \right\rangle = \int \! {\text {NN}}(\mathbf {x}|\boldsymbol{\theta })^k p(\boldsymbol{\theta }|\mathcal {D})\,\mathrm {d}\boldsymbol{\theta }\,. \end{aligned}$$
(4)

While \(\mu \) yields the output of the network, the variance \(\sigma ^2\) is a measure for the uncertainty of the model for the prediction at the given point. Central to this approach is the probability of the data given the model, here denoted by \(p(\mathcal {D} | \boldsymbol{\theta }) \), as it is the key component connecting model and training data. Typically, the prior distribution \(p(\mathcal {D})\) is “ignored” as it only appears as a normalization constant within the averages, see (4). In the cases where the data \(\mathcal {D}\) is itself a distribution due to inherent uncertainty, i.e., presence of aleatoric risk [KG17], such a concept seems natural. However, Bayesian approaches are also applicable for all other cases. In those, loosely speaking, the likelihood of \(\boldsymbol{\theta }\) is determined via the chosen loss function (for the connection between the two concepts cf. [Bis06]).

On this general level, Bayesian approaches are broadly accepted and also find use for many other model classes besides neural networks. However, the loss surfaces of DNNs are known for their high dimensionality and strong non-convexity. Typically, there are abundant parameter combinations \(\boldsymbol{\theta }\) that lead to (almost) equally good approximations to the training data \(\mathcal {D}\) with respect to a chosen loss. This makes an evaluation of \(p(\boldsymbol{\theta }|\mathcal {D})\) for DNNs close to impossible in full generality. At least no (exact) solutions for this case exist at the moment. Finding suitable approximations to the posterior distribution \(p(\boldsymbol{\theta }|\mathcal {D})\) is an ongoing challenge for the construction of BNNs. At this point we only summarize two major research directions in the field. One approach is to assume that the distribution factorizes. While the full solution would be a joint distribution implying correlations between different weights (etc.), possibly even across layers, this approximation takes each element of \(\boldsymbol{\theta }\) to be independent form the others. Although this is a strong assumption, it is often made, in this case, parameters for the respective distributions of each element can be learned via training (cf. [BCKW15]). The second class of approaches focuses on the region of the loss surface around the minimum chosen for the DNN. As discussed, the loss relates to the likelihood and quantities such as the curvature at the minimum, therefore directly connected to the distribution of \(\boldsymbol{\theta }\). Unfortunately, already using this type of quantities requires further approximations [RBB18]. Alternatively, the convergence of the training process may be altered to sample networks close to the minimum [WT11]. While this approach contains information about correlations among the weights, it is usually restricted to a specific minimum. For a non-Bayesian approach taking into account several minima, see deep ensembles in Sect. 7.1. BNNs also touch other concepts such as MC dropout (cf. [GG16] and Sect. 7.1), or prior networks, which are based on a Bayesian interpretation but use conventional DNNs with an additional (learned) \(\sigma \) output [MG18].

6.4 Uncertainty Metrics for DNNs in Frequentist Inference

Classical uncertainty quantification methods in frequentist inference are mostly based on the outputs of statistical models. Their uncertainty is quantified and assessed, for instance, via dispersion measures in classification (such as entropy, probability margin, or variation ratio), or confidence intervals in regression. However, the nature of DNN architectures [RDGF15, CPK+18] and the cutting-edge applications tackled by those (e.g., semantic segmentation, cf. [COR+16]) open the way toward more elaborate uncertainty quantification methods. Besides the mentioned classical approaches, intermediate feature representations within a DNN (cf. [CZYS19, OSM19]) or gradients according to self-affirmation that represent re-learning stress (see [ORG18]) reveal additional information. In addition, in case of semantic segmentation, the geometry of a prediction may give access to further information, cf. [MRG20, RS19, RCH+20]. By the computation of statistics of those quantities as well as low-dimensional representations thereof, we obtain more elaborate uncertainty quantification methods specifically designed for DNNs that can help us to detect misclassifications and out-of-distribution objects (cf. [HG17]). Features gripped during a forward pass of a data point \(\mathbf {x}\) through a DNN \(\mathbf {F}\) can be considered layer-wise, i.e., \(\mathbf {F}_\ell (\mathbf {x})\) after the \(\ell \)-th layer. These can be translated into a handful of quantities per layer [OSM19] or further processed by another DNN that aims at detecting errors [CZYS19]. While, in particular, [OSM19] presents a proof of concept on small-scale classification problems, their applicability to large-scale datasets and problems, such as semantic segmentation and object detection, remains open.

The development for gradient-based uncertainty quantification methods [ORG18] is guided by one central question: if the present prediction was true, how much re-learning would this require. The corresponding hypothesis is that wrong predictions would be more in conflict with the knowledge encoded in the deep neural network than correct ones, therefore causing increased re-learning stress. Given a predicted class

$$\begin{aligned} \hat{y} = \mathop {\mathrm {arg\,max}}_{s\in \mathcal {S}} \mathbf {F}(\mathbf {x}) \end{aligned}$$
(5)

we compute the gradient of layer \(\ell \) corresponding to the predicted label. That is, given a loss function J, we compute

$$\begin{aligned} \nabla _{\ell } J( \hat{y}, \mathbf {x}, \boldsymbol{\theta }) \end{aligned}$$
(6)

via backpropagation. The obtained quantities can be treated similarly to the case of forward pass features. While this concept seems to be prohibitively expensive for semantic segmentation (at least when calculating gradients for each pixel of the multidimensional output \(\hat{\mathbf {y}}\)), its applicability to object detection might be feasible, in particular, with respect to offline applications. Gradients are also of special interest in active learning with query by expected model change (cf. [Set10]).

In the context of semantic segmentation, geometrical information on segments’ shapes as well as neighborhood relations of predicted segments can be taken into account alongside with dispersion measures. It has been demonstrated [RS19, MRG20, RCH+20] that the detection of errors in an in-distribution setting strongly benefits from geometrical information. Recently, this has also been considered in scenarios under moderate domain shift [ORF20]. However, its applicability to out-of-distribution examples and to other sensors than the camera remains subject to further research.

The problem of realistically quantifying uncertainty measures will be taken up in Chapter “Uncertainty Quantification for Object Detection: Output- and Gradient-based Approaches” [RSKR22]. Here output-based and learning-gradient-based uncertainty metrics for object detection will be examined, showing that they are non-correlated. Thus, a combination of both paradigms leads, as is demonstrated, to a better object detection uncertainty estimate and by extension to a better overall detection accuracy.

6.5 Markov Random Fields

Although deep neural networks are currently the state of the art for almost all computer vision tasks, Markov random fields (MRF) remain one of the fundamental techniques used for many computer vision tasks, specifically image segmentation [LWZ09, KK11]. MRFs hold its power in the essence of being able to model dependencies between pixels in an image. With the use of energy functions, MRFs integrate pixels into models relating between unary and pair-wise pixels together [WKP13]. Given the model, MRFs are used to infer the optimal configuration yielding the lowest energy using mainly maximum a posteriori (MAP) techniques. Several MAP inference approaches are used to yield the optimal configuration such as graph cuts [KRBT08] and belief propagation algorithms [FZ10]. However, as with neural networks, MAP inference techniques result in deterministic point estimates of the optimal configuration without any sense of uncertainty in the output. To obtain uncertainties on results from MRFs, most of the work is directed toward modeling MRFs with Gaussian distributions. Getting uncertainties from MRFs with Gaussian distributions is possible by two typical methods: either approximate models are inferred to the posterior, from which sampling is easy or the variances can be estimated analytically, or approximate sampling from the posterior is used. Approximate models include those inferred using variational Bayesian (VB) methods, like mean-field approximations, and using Gaussian process (GP) models enforcing a simplified prior model [Bis06, LUAD16]. Examples of approximate sampling methods include traditional Markov chain Monte Carlo (MCMC) methods like Gibbs sampling [GG84]. Some recent theoretical advances propose the perturb-and-MAP framework and a Gumbel perturbation model (GPM) [PY11, HMJ13] to exactly sample from MRF distributions. Another line of work has also been proposed, where MAP inference techniques are used to estimate the probability of the network output. With the use of graph cuts, [KT08] try to estimate uncertainty using the min-marginals associated with the label assignments of a random field. Here, the work by Kohli and Torr [KT08] was extended to show how this approach can be extended to techniques other than graph cuts [TA12] or compute uncertainties on multi-label marginal distributions [STP17]. A current research direction is the incorporation of MRFs with deep neural networks, along with providing uncertainties on the output [SU15, CPK+18]. This can also be extended to other forms of neural networks such as recurrent neural networks to provide uncertainties on segmentation of streams of videos with extending dependencies of pixels to previous frames [ZJRP+15, LLL+17].

6.6 Confidence Calibration

Neural network classifiers output a label \(\hat{Y} \in \mathcal {Y}\) on a given input \(\mathbf {x} \in \mathcal {D}\) with an associated confidence \(\hat{P}\). This confidence can be interpreted as a probability of correctness that the predicted label matches the ground-truth label \(\overline{Y} \in \mathcal {Y}\). Therefore, these probabilities should reflect the “self-confidence” of the system. If the empirical accuracy for any confidence level matches the predicted confidence, a model is called well calibrated. Therefore, a classification model is perfectly calibrated if

$$\begin{aligned}&\underbrace{P(\hat{Y} = Y | \hat{P} = p)}_{\text {accuracy given } p} = \underbrace{p}_{\text {confidence}} \quad \forall p \in [0,1] \end{aligned}$$
(7)

is fulfilled [GPSW17]. For example, assume 100 predictions with confidence values of 0.9. We call the model well calibrated if 90 out of these 100 predictions are actually correct. However, recent work has shown that modern neural networks tend to be too overconfident in their predictions [GPSW17]. The deviation of a model to the perfect calibration can be measured by the expected calibration error (ECE) [NCH15]. It is possible to recalibrate models as a post-processing step after classification. One way to get a calibration mapping is to group all predictions into several bins by their confidence. Using such a binning scheme, it is possible to compute the empirical accuracy for certain confidence levels, as it is known for a long time already in reconstructing confidence outputs for Viterbi decoding [HR90]. Common methods are histogram binning [ZE01], isotonic regression [ZE02], or more advanced methods like Bayesian binning into quantiles (BBQ) [NCH15] and ensembles of near-isotonic regression (ENIR) [NC16]. Another way to get a calibration mapping is to use scaling methods based on logistic regression like Platt scaling [Pla99], temperature scaling [GPSW17], and beta calibration [KSFF17].

In the setting of probabilistic regression, a model is calibrated if 95% of the true target values are below or equal to a credible level of 95% (so-called quantile-calibrated regression) [GBR07, KFE18, SDKF19]. A regression model is usually calibrated by fine-tuning its predicted CDF in a post-processing step to match the empirical frequency. Common approaches utilize isotonic regression [KFE18], logistic and beta calibration [SKF18], as well as Gaussian process models [SKF18, SDKF19] to build a calibration mapping. In contrast to quantile-calibrated regression, [SDKF19] have recently introduced the concept of distribution calibration, where calibration is applied on a distribution level and naturally leads to calibrated quantiles. Recent work has shown that miscalibration in the scope of object detection also depends on the position and scale of a detected object [KKSH20]. The additional box regression output is denoted by \(\hat{\mathbf {R}}\) with A as the size of the used box encoding. Furthermore, if we have no knowledge about all anchors of a model (which is a common case in many applications), it is not possible to determine the accuracy. Therefore, Küppers et al.  [KKSH20] use the precision as a surrogate for accuracy and propose that an object detection model is perfectly calibrated if

$$\begin{aligned}&\underbrace{P(M=1 | \hat{P} = p, \hat{Y} = y, \hat{\mathbf {R}} = \mathbf {r})}_{\text {precision given } p, y, \mathbf {r}} = \underbrace{p}_{\text {confidence}} \quad \forall p \in [0,1], y \in \mathcal {Y}, \mathbf {r} \in \mathbb {R}^A \end{aligned}$$
(8)

is fulfilled, where \(M=1\) denotes a correct prediction that matches a ground-truth object with a chosen IoU threshold and \(M=0\) denotes a mismatch, respectively. The authors propose the detection-expected calibration error (D-ECE) as the extension of the ECE to object detection tasks in order to measure miscalibration also by means of the position and scale of detected objects. Other approaches try to fine-tune the regression output in order to obtain more reliable object proposals [JLM+18, RTG+19] or to add a regularization term to the training objective such that training yields models that are both well performing and well calibrated [PTC+17, SSH19].

In Chapter “Confidence Calibration for Object Detection and Segmentation” [KHKS22], an extension of the ECE for general multivariate learning problems, such as object detection or image segmentation, will be discussed. In fact, the proposed multivariate confidence calibration is designed to take additional information into account, such as bounding boxes or shape descriptors. It is shown that this extended calibration reduces the calibration error as expected but also has a positive bearing on the quality of segmentation masks.

7 Aggregation

From a high-level perspective, a neural network is based on processing inputs and coming to some output conclusion, e.g., mapping incoming image data onto class labels. Aggregation or collection of non-independent information on either the input or output side of this network function can be used as a tool to leverage its performance and reliability. Starting with the input, any additional “dimension” to add data can be of use. For example, in the context of automated vehicles, this might be input from any further sensor measuring the same scene as the original one, e.g., stereo cameras or LiDAR. Combining those sensor sets for prediction is commonly referred to as sensor fusion [CBSW19]. Staying with the example, the scene will be monitored consecutively providing a whole (temporally ordered) stream of input information. This may be used either by adjusting the network for this kind of input [KLX+17] or in terms of a post-processing step, in which the predictions are aggregated by some measure of temporal consistency.

Another more implicit form of aggregation is training the neural network on several “independent” tasks, e.g., segmentation and depth regression. Although the individual task is executed on the same input, the overall performance can still benefit from the correlation among all given tasks. We refer to the discussion on multi-task networks in Sect. 9.2. By extension, solving the same task in multiple different ways can be beneficial for performance and provide a measure of redundancy. In this survey, we focus on single-task systems and discuss ensemble methods in the next section and the use of temporal consistency in the one thereafter.

7.1 Ensemble Methods

Training a neural network is optimizing its parameters to fit a given training dataset. The commonly used gradient-based optimization schemes cause convergence in a “nearby” local minimum. As the loss landscapes of neural networks are notoriously non-convex [CHM+15], various locally optimal model parameter sets exist. These local optima differ in the degree of optimality (“deepness”), qualitative characteristics (“optimal for different parts of the training data”), and their generalizability to unseen data (commonly referred to by the geometrical terms of “sharpness” and “flatness” of minima [KMN+17]).

A single trained network corresponds to one local minimum of such a loss landscape and thus captures only a small part of a potentially diverse set of solutions. Network ensembles are collections of models and therefore better suited to reflect this multi-modality. Various modeling choices shape a loss landscape: the selected model class and its meta-parameters (like architecture and layer width), the training data and the optimization objective. Accordingly, approaches to diversify ensemble components range from combinations of different model classes over varying training data (bagging) to methods that train and weight ensemble components to make up for the flaws of other ensemble members (boosting) [Bis06].

Given the millions of parameters of application-size networks, ensembles of NNs are resource-demanding w.r.t. computational load, storage, and runtime during training and inference. This complexity increases linearly with ensemble size for naïve ensembling. Several approaches were put forward to reduce some dimensions of this complexity: snapshot ensembles [HLP+17] require only one model optimization with a cyclical learning-rate schedule leading to an optimized training runtime. The resulting training trajectory passes through several local minima. The corresponding models compose the ensemble. On the contrary, model distillation [HVD14] tackles runtime at inference. They “squeeze” an NN ensemble into a single model that is optimized to capture the gist of the model set. However, such a compression goes along with reduced performance compared to the original ensemble.

Several hybrids of single model and model ensemble exist: multi-head networks [AJD18] share a backbone network that provides inputs to multiple prediction networks. Another variant is mixture-of-expert model that utilizes a gating network to assign inputs to specialized expert networks [SMM+17]. Multi-task networks (cf. Sect. 9.2) and Bayesian approximations of NNs (cf. Sects. 6.2 and 6.3) can be seen as implicit ensembles.

NN ensembles (or deep ensembles) are not only used to boost model quality. They pose the frequentist’s approach to estimating NN uncertainties and are state of the art in this regard [LPB17, SOF+19]. The emergent field of federated learning is concerned with the integration of decentrally trained ensemble components [MMR+17] and safety-relevant applications of ensembling range from automated driving [Zha12] to medical diagnostics [RRMH17]. Taking this safe-ML perspective, promising research directions comprise a more principled and efficient composition of model ensembles, by application-driven diversification as well as improved techniques to miniaturize ensembles and by gaining a better understanding of methods like model distillation. In the long run, better designed, more powerful learning systems might partially reduce the need for combining weaker models in a network ensemble.

In Chapter “Evaluating Mixture-of-Expert Architectures for Network Aggregation” [PHW22], the advantages and drawbacks of mixture-of-experts architectures with respect to robustness, interpretability, and overall test performance will be discussed. Two state-of-the-art architectures are investigated, obtaining, sometimes surpassing, baseline performance. What is more, the models exhibit improved awareness for OoD data.

7.2 Temporal Consistency

The focus of previous DNN development for semantic segmentation has been on single-image prediction. This means that the final and intermediate results of the DNN are discarded after each image. However, the application of a computer vision model often involves the processing of images in a sequence, i.e., there is a temporal consistency in the image content between consecutive frames (for a metric, cf., e.g., [VBB+20]). This consistency has been exploited in previous work to increase quality and reduce computing effort. Furthermore, this approach offers the potential to improve the robustness of DNN prediction by incorporating this consistency as a priori knowledge into DNN development. The relevant work in the field of video prediction can be divided into two major approaches:

  1. 1.

    DNNs are specially designed for video prediction. This usually requires training from scratch and the availability of training data in a sequence.

  2. 2.

    A transformation from single prediction DNNs to video prediction DNNs takes place. Usually no training is required, i.e., the existing weights of the model can be used unaltered.

The first set of approaches often involves conditional random fields (CRF) and its variants. CRFs are known for their use as post-processing step in the prediction of semantic segmentation, in which their parameters are learned separately or jointly with the DNN [ZJRP+15]. Another way to use spatiotemporal features is to include 3D convolutions, which add an additional dimension to the conventional 2D convolutional layer. Tran et al. [TBF+15] use 3D convolution layers for video recognition tasks, such as action and object recognition. One further approach to use spatial and temporal characteristics of the input data is to integrate long short-term memory (LSTM) [HS97], a variant of the recurrent neural network (RNN). Fayyaz et al. [FSS+16] integrate LSTM layers between the encoder and decoder of their convolutional neural network for semantic segmentation. The significantly higher GPU memory requirements and computational effort are disadvantages of this method. More recently, Nilsson and Sminchisescu [NS18] deployed gated recurrent units, which generally require significantly less memory. An approach to improve temporal consistency of automatic speech recognition outputs is known as a posterior-in-posterior-out (PIPO) LSTM “sequence enhancer”, a postfilter which could be applicable to video processing as well [LSFs19]. A disadvantage of the described methods is that sequential data for training must be available, which may be limited or show a lack of diversity.

The second class of approaches has the advantage that it is model independent most of the time. Shelhamer et al. [SRHD16] found that the deep feature maps within the network change only slightly with temporal changes in video content. Accordingly, [GJG17] calculate the optical flow of the input images from time steps \(t=0\) and \(t={-1}\) and convert it into the so-called transform flow which is used to transform the feature maps of the time step \(t={-1}\) so that an aligned representation to the feature map \(t=0\) is achieved. Sämann et al. [SAMG19] use a confidence-based combination of feature maps from previous time steps based on the calculated optical flow.

8 Verification and Validation

Verification and validation is an integral part of the safety assurance for any safety-critical systems. As of the functional safety standard for automotive systems [ISO18], verification means to determine whether given requirements are met [ISO18, 3.180], such as performance goals. Validation, on the other side, tries to assess whether the given requirements are sufficient and adequate to guarantee safety [ISO18, 3.148], e.g., whether certain types of failures or interactions simply were overlooked. The latter is usually achieved via extensive testing in real operation conditions of the integrated product. This differs from the notion of validation used in the machine learning community in which it usually refers to simple performance tests on a selected dataset. In this section, we want to concentrate on general verification aspects for deep neural networks.

Verification as in the safety domain encompasses (manual) inspection and analysis activities, and testing. However, the contribution of single processing steps within a neural network to the final behavior can hardly be assessed manually (compared to the problem of interpretability in Sect. 5). Therefore, we here will concentrate on different approaches to verification testing, i.e., formal testing and black-box methods.

The discussion on how to construct an assurance case and how to derive suitable acceptance criteria for ML models in the scope of autonomous driving, as well as how to determine and acquire sufficient evidences for such an assurance case is provided in Chapter “Safety Assurance of Machine Learning for Perception Functions” [BHH+22]. The authors look into several use cases on how to systematically obtain evidences and incorporate them into a structured safety argumentation.

In Chapter “A Variational Deep Synthesis Approach for Perception Validation” [GHS22], a novel data generation framework is introduced, aiming at validating perception functions, in particular, in the context of autonomous driving. The framework allows for effectively defining and testing common traffic scenes. Experiments are performed for pedestrian detection and segmentation in versatile urban scenarios.

8.1 Formal Testing

Formal testing refers to testing methods that include formalized and formally verifiable steps, e.g., for test data acquisition, or for verification in the local vicinity of test samples. For image data, local testing around valid samples is usually more practical than fully formal verification: (safety) properties are not expected to hold on the complete input space but only on the much smaller unknown lower dimensional manifold of real images [WGH19]. Sources of such samples can be real ones or generated ones using an input space formalization or a trained generative model.

Coverage criteria for the data samples are commonly used for two purposes: (a) deciding when to stop testing or (b) identifying missing tests. For CNNs, there are at least three different approaches toward coverage: (1) approaches that establish coverage based on a model with semantic features of the input space [GHHW20], (2) approaches trying to semantically cover the latent feature space of neural network or a proxy network (e.g., an autoencoder) [SS20a], and (3) approaches trying to cover neurons and their interactions, inspired by classical software white-box analysis [PCYJ19, SHK+19].

Typical types of properties to verify are simple test performance, local stability (robustness), a specific structure of the latent spaces like embedding of semantic concepts [SS20b], and more complex logical constraints on inputs and outputs, which can be used for testing when fuzzified [RDG18]. Most of these properties require in-depth semantic information about the DNN inner workings, which is often only available via interpreting intermediate representations [KWG+18], or interpretable proxies/surrogates (cf. Sect. 5.4), which do not guarantee fidelity.

There exist different testing and formal verification methods from classical software engineering that have already been applied to CNNs. Differential testing as used by DeepXPlore [PCYJ19] trains n different CNNs for the same task using independent datasets and compares the individual prediction results on a test set. This allows to identify inconsistencies between the CNNs but no common weak spots. Data augmentation techniques start from a given dataset and generate additional transformed data. Generic data augmentation for images like rotations and translation is state of the art for training but may also be used for testing. Concolic testing approaches incrementally grow test suites with respect to a coverage model to finally achieve completeness. Sun et al. [SWR+18] use an adversarial input model based on some norm (cf. Sect. 4.1), e.g., an \(L_p\)-norm, for generating additional images around a given image using concolic testing. Fuzzing generates new test data constrained by an input model and tries to identify interesting test cases, e.g., by optimizing white-box coverage mentioned above [OOAG19]. Fuzzing techniques may also be combined with the differential testing approach discussed above [GJZ+18]. In all these cases, it needs to be ensured that the image as well as its meta-data remains valid for testing after transformation. Finally, proving methods surveyed by Liu et al. [LAL+20] try to formally prove properties on a trained neural network, e.g., based on satisfiability modulo theories (SMT). These approaches require a formal characterization of an input space and the property to be checked, which is hard for non-trivial properties like contents of an image.

Existing formal testing approaches can be quite costly to integrate into testing workflows: differential testing and data augmentation require several inferences per initial test sample; concolic and fuzzy testing apply an optimization to each given test sample, while convergence toward the coverage goals is not guaranteed; also, the iterative approaches need tight integration into the testing workflow; and lastly, proving methods usually have to balance computational efficiency against the precision or completeness of the result [LAL+20]. Another challenge of formal testing is that machine learning applications usually solve problems for which no (formal) specification is possible. This makes it hard to find useful requirements for testing [ZHML20] and properties that can be formally verified. Even partial requirements, such as specification of useful input perturbations, specified corner cases, and valuable coverage goals, are typically difficult to identify [SS20a, BTLFs20].

8.2 Black-Box Methods

In machine learning literature, neural networks are often referred to as black boxes due to the fact that their internal operations and their decision-making are not completely understood [SZT17], hinting at a lack of interpretability and transparency. However, in this survey, we consider a black box to be a machine learning model to which we only have Oracle (query) access [TZJ+16, PMG+17]. That means we can query the model to get input–output pairs, but we do not have access to the specific architecture (or weights, in case of neural networks). As [OAFS18] describes, black boxes are increasingly widespread, e.g., health care, autonomous driving, or ML as a service in general, due to proprietary, privacy, or security reasons. As deploying black boxes gains popularity, so do methods that aim to extract internal information, such as architecture and parameters, or to find out, whether a sample belongs to the training dataset. These include model extraction attacks [TZJ+16, KTP+20], membership inference attacks [SSSS17], general attempts to reverse-engineer the model [OAFS18], or to attack it adversarially [PMG+17]. Protection and counter-measures are also actively researched: Kesarwani et al. [KMAM18] propose a warning system that estimates how much information an attacker could have gained from queries. Adi et al. [ABC+18] use watermarks for models to prevent illegal re-distribution and to identify intellectual property.

Many papers in these fields make use of so-called surrogate, avatar, or proxy models that are trained on input–output pairs of the black box. In case the black-box output is available in soft form (e.g., logits), distillation as first proposed by [HVD14] can be applied to train the surrogate (student) model. Then, any white-box analysis can be performed on the surrogates (cf. e.g., [PMG+17]) to craft adversarial attacks targeted at the black box. More generally, (local) surrogates, as, for example, in [RSG16], can be used to (locally) explain its decision-making. Moreover, these techniques are also of interest if one wants to compare or test black-box models (cf. Sect. 8.1). This is the case, among others, in ML marketplaces, where you wish to buy a pre-trained model [ABC+18], or if you want to verify or audit that a third-party black-box model obeys regulatory rules (cf. [CH19]).

Another topic of active research is so-called observers. The concept of observers is to evaluate the interface of a black-box module to determine if it behaves as expected within a given set of parameters. The approaches can be divided into model-explaining and anomaly-detecting observers. First, model explanation methods answer the question of which input characteristic is responsible for changes at the output. The observer is able to alter the inputs for this purpose. If the input of the model under test evolves only slightly but the output changes drastically, this can be a signal that the neural network is misled, which is also strongly related to adversarial examples (cf. Sect. 4). Hence, the reason for changes in the classification result via the input can be very important. In order to figure out in which region of an input image the main reason for the classification is located, [FV17] “delete” information from the image by replacing regions with generated patches until the output changes. This replaced region is likely responsible for the decision of the neural network. Building upon this, [UEKH19] adapt the approach to medical images and generate “deleted” regions by a variational autoencoder (VAE). Second, anomaly-detecting observers register input and output anomalies, either examining input and output independently or as an input–output pair, and predict the black-box performance in the current situation. In contrast to model-explaining approaches, this set of approaches has high potential to be used in an online scenario since it does not need to modify the model input. The maximum mean discrepancy (MMD) [BGR+06] measures the domain gap between two data distributions independently from the application and can be used to raise a warning if input or output distributions during inference deviate too strongly from their respective training distributions. Another approach is using a GAN-based autoencoder [LFK+20] to perform a domain shift estimation where the Wasserstein distance is used as domain mismatch metric. This metric can also be evaluated by use of a casual time-variant aggregation of distributions during inference time.

9 Architecture

In order to solve a specific task, the architecture of a CNN and its building blocks play a significant role. Since the early days of using CNNs in image processing, when they were applied to handwriting recognition [LBD+89] and the later breakthrough in general image classification [KSH12], the architecture of the networks has changed radically. Did the term of deep learning for these first convolutional neural networks imply a depth of approximately four layers, their depth increased significantly during the last years and new techniques had to be developed to successfully train and utilize these networks [HZRS16]. In this context, new activation functions [RZL18] as well as new loss functions [LGG+17] have been designed and new optimization algorithms [KB15] were investigated.

With regard to the layer architecture, the initially alternating repetition of convolution and pooling layers as well as their characteristics have changed significantly. The convolution layers made the transition from a few layers with often large filters to many layers with small filters. A further trend was then the definition of entire modules, which were used repeatedly within the overall architecture as so-called network in network [LCY14].

In areas such as automated driving, there is also a strong interest in the simultaneous execution of different tasks within one single convolutional neural network architecture. This kind of architecture is called multi-task learning (MTL) [Car97] and can be utilized in order to save computational resources and at the same time to increase performance of each task [KTMFs20]. Within such multi-task networks, usually one shared feature extraction part is followed by one separate so-called head per task [TWZ+18].

In each of these architectures, manual design using expert knowledge plays a major role. The role of the expert is the crucial point here. In recent years, however, there have also been great efforts to automate the process of finding architectures for networks or, in the best case, to learn them. This is known under the name neural architecture search (NAS).

9.1 Building Blocks

Designing a convolutional neural network typically includes a number of design choices. The general architecture usually contains a number of convolutional and pooling layers which are arranged in a certain pattern. Convolutional layers are commonly followed by a nonlinear activation function. The learning process is based on a loss function which determines the current error and an optimization function that propagates the error back to the single convolution layers and its learnable parameters.

When CNNs became state of the art in computer vision [KSH12], they were usually built using a few alternating convolutional and pooling layers having a few fully connected layers in the end. It turned out that better results are achieved with deeper networks and so the number of layers increased [SZ15] over the years. To deal with these deeper networks, new architectures had to be developed. In a first step, to reduce the number of parameters, the convolutional layers with partly large filter kernels were replaced by several layers with small \(3 \times 3\) kernels. Today, most architectures are based on the network-in-network principle [LCY14], where more complex modules are used repeatedly. Examples of such modules are the inception module from GoogleNet [SLJ+15] or the residual block from ResNet [HZRS16]. While the inception module consists of multiple parallel strings of layers, the residual blocks are based on the highway network [SGS15], which means that they can bypass the original information and the layers in between are just learning residuals. With ResNeXt [XGD+17] and Inception-ResNet [SIVA17] there already exist two networks that combine both approaches. For most tasks, it turned out that replacing the fully connected layers by convolutional layers is much more convenient making the networks fully convolutional [LSD15]. These so-called fully convolutional networks (FCN) are no longer bound to fixed input dimensions. Note that with the availability of convolutional long short-term memory (ConvLSTM) structures also fully convolutional recurrent neural networks (FCRNs) became available for fully scalable sequence-based tasks [SCW+15, SDF+20].

Inside the CNNs, the rectified linear unit (ReLU) has been the most frequently used activation function for a long time. However, since this function suffers from problems related to the mapping of all negative values to zero like the vanishing gradient problem, new functions have been introduced in recent years. Examples are the exponential linear unit (ELU), swish [RZL18], and the non-parametric linearly scaled hyperbolic tangent (LiSHT) [RMDC19]. In order to be able to train a network consisting of these different building blocks, the loss function is the most crucial part. This function is responsible for how and what the network ultimately learns and how exactly the training data is applied during the training process to make the network train faster or perform better. So the different classes can be weighted in a classification network with fixed values or so-called \(\alpha \)-balancing according to their probability of occurrence. Another interesting approach is weighting training examples according to their easiness for the current network [LGG+17, WFZ19]. For multi-task learning also weighting tasks based on their uncertainty [KGC18] or gradients [CBLR18] can be done as further explained in Sect. 9.2. A closer look on how a modification of the loss function might affect safety-related aspects is given in Sect. 3, Modification of Loss.

9.2 Multi-task Networks

Multi-task learning (MTL) in the context of neural networks describes the process of optimizing several tasks simultaneously by learning a unified feature representation [Car97, RBV18, GHL+20, KBFs20] and coupling the task-specific loss contributions, thereby enforcing cross-task consistency [CPMA19, LYW+19, KTMFs20].

Unified feature representation is usually implemented by sharing the parameters of the initial layers inside the encoder (also called feature extractor). It not only improves the single tasks by more generalized learned features but also reduces the demand for computational resources at inference. Not an entirely new network has to be added for each task but only a task-specific decoder head. It is essential to consider the growing amount of visual perception tasks in automated driving, e.g., depth estimation, semantic segmentation, motion segmentation, and object detection. While the parameter sharing can be soft, as in cross stitch [MSGH16] and sluice networks [RBAS18], or hard [Kok17, TWZ+18], meaning ultimately sharing the parameters, the latter is usually preferred due to its straightforward implementation and lower computational complexity during training and inference.

Compared to implicitly coupling tasks via a shared feature representation, there are often more direct ways to optimize the tasks inside cross-task losses jointly. It is only made possible as, during MTL, there are network predictions for several tasks, which can be enforced to be consistent. As an example, sharp depth edges should only be at class boundaries of semantic segmentation predictions. Often both approaches to MTL are applied simultaneously [YZS+18, CLLW19] to improve a neural network’s performance as well as to reduce its computational complexity at inference.

While the theoretical expectations for MTL are quite clear, it is often challenging to find a good weighting strategy for all the different loss contributions as there is no theoretical basis on which one could choose such a weighting with early approaches either involving heuristics or extensive hyperparameter tuning. The easiest way to balance the tasks is to use uniform weight across all tasks. However, the losses from different tasks usually have different scales, and uniformly averaging them suppresses the gradient from tasks with smaller losses. Addressing these problems, Kendall et al. [KGC18] propose to weigh the loss functions by the homoscedastic uncertainty of each task. One does not need to tune the weighting parameters of the loss functions by hand, but they are adapted automatically during the training process. Concurrently Chen et al. [CBLR18] propose GradNorm, which does not explicitly weigh the loss functions of different tasks but automatically adapts the gradient magnitudes coming from the task-specific network parts on the backward pass. Liu et al. [LJD19] proposed dynamic weight average (DWA), which uses an average of task losses over time to weigh the task losses.

9.3 Neural Architecture Search

In the previous sections, we saw manually engineered modifications of existing CNN architectures proposed by ResNet [HZRS16] or Inception [SLJ+15]. They are results of human design and showed their ability to improve performance. ResNet introduces a skip connection in building blocks and Inception makes use of its specific inception module. Hereby, the intervention by an expert is crucial. The approach of neural architecture search (NAS) aims to automate this time-consuming and manual design of neural network architectures.

NAS is closely related to hyperparameter optimization (HO), which is described in Sect. 3, hyperparameter optimization. Originally, both tasks were solved simultaneously. Consequently, the kernel size or number of filters were seen as additional hyperparameters. Nowadays, the distinction between HO and NAS should be stressed. The concatenation of complex building blocks or modules cannot be accurately described with single parameters. This simplification is no longer suitable.

To describe the NAS process, the authors of [EMH19b] define three steps: (1) definition of search space, (2) search strategy, and (3) performance estimation strategy.

Majority of search strategies take advantage of the NASNet search space [ZVSL18] which arranges various operations, e.g., convolution, pooling within a single cell. However, other spaces based on a chain or multi-branch structure are possible [EMH19b]. The search strategy comprises advanced methods from sequential model-based optimization (SMBO) [LZN+18], Bayesian optimization [KNS+18], evolutionary algorithms [RAHL19, EMH19a], reinforcement learning [ZVSL18, PGZ+18], and gradient descent [LSY19, SDW+19]. Finally, the performance estimation describes approximation techniques due to the impracticability of multiple evaluation runs. For a comprehensive survey regarding the NAS process, we refer to [EMH19b]. Recent research has shown that reinforcement learning approaches such as NASNet-A [ZVSL18] and ENAS [PGZ+18] are partly outperformed by evolutionary algorithms, e.g., AmoebaNet [RAHL19] and gradient-based approaches, e.g., DARTS [LSY19].

Each of these approaches focuses on different optimization aspects. Gradient-based methods are applied to a continuous search space and offer faster optimization. On the contrary, the evolutionary approach LEMONADE [EMH19a] enables multi-object optimization by considering the conjunction of resource consumption and performance as the two main objectives. Furthermore, single-path NAS [SDW+19] extends the multi-path approach of former gradient-based methods and proposes the integration of “over-parameterized superkernels”, which significantly reduces memory consumption.

The focus of NAS is on the optimized combination of humanly predefined CNN elements with respect to objectives such as resource consumption and performance. NAS offers automation, however, the realization of the objectives is strongly limited by the potential of the CNN elements.

10 Model Compression

Recent developments in CNNs have resulted in neural networks being the state of the art in computer vision tasks like image classification [KSH12, HZRS15, MGR+18], object detection [Gir15, RDGF15, HGDG17], and semantic segmentation [CPSA17, ZSR+19, LBS+19, WSC+20]. This is largely due to the increasing availability of hardware computational power and an increasing amount of training data. We also observe a general upward trend of the complexities of the neural networks along with their improvement in state-of-the-art performance. These CNNs are largely trained on back-end servers with significantly higher computing capabilities. The use of these CNNs in real-time applications is inhibited due to the restrictions on hardware, model size, inference time, and energy consumption. This led to an emergence of a new field in machine learning, commonly termed as model compression. Model compression basically implies reducing the memory requirements, inference times, and model size of DNNs to eventually enable the use of neural networks on edge devices. This is tackled by different approaches, such as network pruning (identifying weights or filters that are not critical for network performance), weight quantizations (reducing the precision of the weights used in the network), knowledge distillation (a smaller network is trained with the knowledge gained by a bigger network), and low-rank factorization (decomposing a tensor into multiple smaller tensors). In this section, we introduce some of these methods for model compression and discuss in brief the current open challenges and possible research directions with respect to its use in automated driving applications.

In Chapter “Joint Optimization for DNN Model Compression and Corruption Robustness” [VHB+22], the concept of both pruning and quantization on the application of semantic segmentation in autonomous driving is elucidated, while at the same time robustifying the model against corruptions in the input data. An improved performance and robustness for a state-of-the-art model for semantic segmentation is demonstrated.

10.1 Pruning

Pruning has been used as a systematic tool to reduce the complexity of deep neural networks. The redundancy in DNNs may exist on various levels, such as the individual weights, filters, and even layers. All the different methods for pruning try to take advantage of these available redundancies on various levels. Two of the initial approaches for neural networks proposed weight pruning in the 1990s as a way of systematically damaging neural networks [CDS90, Ree93]. As these weight pruning approaches do not aim at changing the structure of the neural network, these approaches are called unstructured pruning. Although there is reduction in the size of the network when it is saved in sparse format, the acceleration depends on the availability of hardware that facilitates sparse multiplications. As pruning filters and complete layers aim at exploiting the available redundancy in the architecture or structure of neural networks, these pruning approaches are called structured pruning. Pruning approaches can also be broadly classified into: data-dependent and data-independent methods. Data-dependent methods [LLS+17, LWL17, HZS17] make use of the training data to identify filters to prune. Theis et al. [TKTH18] and Molchanov et al. [MTK+17] propose a greedy pruning strategy that identifies the importance of feature maps one at a time from the network and measures the effect of removal of the filters on the training loss. This means that filters corresponding to those feature maps that have least effect on training loss are removed from the network. Within data-independent methods [LKD+17, HKD+18, YLLW18, ZQF+18], the selection of CNN filters to be pruned are based on the statistics of the filter values. Li et al. [LKD+17] proposed a straightforward method to calculate the rank of filters in a CNN. The selection of filters are based on the \(L_1\)-norm, where the filter with the lowest norm is pruned away. He et al. [HZS17] employ a least absolute shrinkage and selection operator (LASSO) regression-based selection of filters to minimize the least squares reconstruction.

Although the abovementioned approaches demonstrated that a neural network can be compressed without affecting the accuracy, the effect on robustness is largely unstudied. Dhillon et al. [DAL+18] proposed pruning a subset of activations and scaling up the survivors to show improved adversarial robustness of a network. Lin et al. [LGH19] quantize the precision of the weights after controlling the Lipschitz constant of layers. This restricts the error propagation property of adversarial perturbations within the neural network. Ye et al. [YXL+19] evaluated the relationship between adversarial robustness and model compression in detail and show that naive compression has a negative effect on robustness. Gui et al. [GWY+19] co-optimize robustness and compression constraints during the training phase and demonstrate improvement in the robustness along with reduction in the model size. However, these approaches have mostly been tested on image classification tasks and on smaller datasets only. Their effectiveness on safety-relevant automated driving tasks, such as object detection and semantic segmentation, is not studied and remains an open research challenge.

10.2 Quantization

Quantization of a random variable x having a probability density function p(x) is the process of dividing the range of x into intervals, each is represented using a single value (also called reconstruction value), such that the following reconstruction error is minimized:

$$\begin{aligned} \sum _{i=1}^{I} \int _{b_i}^{b_{i+1}} (q_i - x)^2 p(x) dx, \end{aligned}$$
(9)

where \(b_i\) is the left-side border of the i-th interval, \(q_i\) is its reconstruction value, and I is the number of intervals, e.g., \(I = 8\) for a 3-bit quantization. This definition can be extended to multiple dimensions as well.

Quantization of neural networks has been around since the 1990s [Guo18], however, with a focus in the early days on improving the hardware implementations of these networks. In the deep learning literature, a remarkable application of quantization combined with unstructured pruning can be found in the approach of deep compression [HMD16], where one-dimensional k-means is utilized to cluster the weights per layer and thus finding the I cluster centers (\(q_i\) values in (9)) iteratively. This procedure conforms to an implicit assumption that p(x) has the same spread inside all clusters. Deep compression can reduce the network size needed for image classification by a factor of 35 for AlexNet and a factor of 49 for VGG-16 without any loss in accuracy. However, as pointed out in [JKC+18], these networks from the early deep learning days are over-parameterized and a less impressing compression factor is thus expected when the same technique is applied to lightweight architectures, such as MobileNet and SequeezeNet. For instance, considering SqueezeNet (50 times smaller than AlexNet), the compression factor of deep compression without accuracy loss drops to about 10.

Compared to scalar quantization used in deep compression, there were attempts to exploit the structural information by applying variants of vector quantization of the weights [GLYB14, CEL20, SJG+20]. Remarkably, in the latter (i.e., [SJG+20]), the reconstruction error of the activations (instead of the weights) is minimized in order to find an optimal codebook for the weights, as the ultimate goal of quantization is to approximate the network’s output not the network itself. This is performed in a layer-by-layer fashion (as to prevent error accumulation) using activations generated from unlabeled data.

Other techniques [MDSN17, JKC+18] apply variants of so-called “linear” quantization, i.e., the quantization staircase has a fixed interval size. This paradigm conforms to an implicit assumption that p(x) in (9) is uniform and is thus also called uniform quantization. The uniform quantization is widely applied both in specialized software packages, such as Texas Instruments Deep Learning Library (automotive boards) [MDS+18], and in general-purpose libraries, such as Tensorflow Lite. The linearity assumption enables practical implementations, as the quantization and dequantization can be implemented using a scaling factor and an intercept, whereas no codebook needs to be stored. In many situations, the intercept can be omitted by employing a symmetric quantization mapping. Moreover, for power of 2 ranges, the scaling ends up being a bit-wise shift operator, where quantization and dequantization differ only in the shift direction. It is also straightforward to apply this scheme dynamically, i.e., for each tensor separately using a tensor-specific multiplicative factor. This can be easily applied not only to filters (weight tensors) but also to activation tensors (see, for instance, [MDSN17]).

Unless the scale factor in the linear quantization is assumed constant by construction, it is computed based on the statistics of the relevant tensor and can be thus sensitive to outliers. This is known to result in a low-precision quantization. In order to mitigate this issue, the original range can be clipped and thus reduced to the most relevant part of the signal. Several approaches are proposed in the literature for finding an optimal clipping threshold: simple percentile analysis of the original range (e.g., clipping 2% of the largest magnitude values), minimizing the mean squared error between the quantized and original range in the spirit of (9) [BNS19], or minimizing the Kullback–Leibler divergence between the original and the quantized distributions [Mig17]. While the clipping methods trade off large quantization errors of outliers against small errors of inliers [WJZ+20], other methods tackle the outliers problem using a different trade-off, see, for instance, the outlier channel splitting approach in [ZHD+19].

An essential point to consider when deciding for a quantization approach for a given problem is the allowed or intended interaction with the training procedure. The so-called post-training quantization, i.e., quantization of a pre-trained network, seems to be attractive from a practical point of view: no access to training data is required and the quantization and training toolsets can be independent from each other. On the other hand, the training-aware quantization methods often yield higher inference accuracy and shorter training times. The latter is a serious issue for large complicated models which may need weeks to train on modern GPU clusters. The training-aware quantization can be implemented by inserting fake quantization operators in the computational graph of the forward pass during training (simulated quantization), whereas the backward pass is done as usual in floating-point resolution [JKC+18]. Other approaches [ZWN+18, ZGY+19] go a step further by quantizing the gradients as well. This leads to much lower training time, as the time of the often computationally expensive backward pass is reduced. The gradient’s quantization, however, is not directly applicable as it requires the derivative of the quantization function (staircase-like), which is zero almost everywhere. Luckily, this issue can be handled by employing a straight-through estimator [BLC13] (approximating the quantization function by an identity mapping). There are also other techniques proposed recently to mitigate this problem [UMY+20, LM19].

11 Discussion

We have presented an extensive overview of approaches to effectively handle safety concerns accompanying deep learning: lack of generalization, robustness, explainability, plausibility, and efficiency. It has been described which lines of research we deem prevalent, important, and promising for each of the individual topics and categories into which the presented methods fall.

The reviewed methods alone will not provide safe-ML systems as such and neither will their future extensions. This is due to the limitations of quantifying complex real-world contexts. A complete and plausible safety argumentation will, thus, require more than advances in methodology and theoretical understanding of neural network properties and training processes. Apart from methodological progress, it will be necessary to gain practical experience in using the presented methods to gather evidence for overall secure behavior, using this evidence to construct a tight safety argument, and testing its validity in various situations.

In particular, each autonomously acting robotic system with state-of-the-art deep-learning-based perception and non-negligible actuation may serve as an object of study and is, in fact, in need of this kind of systematic reasoning before being transferred to widespread use or even market entry. We strongly believe that novel scientific insights, the potential market volume, and public interest will drive the arrival of reliant and trustworthy AI technology.