The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation

Deep learning has enabled impressive progress in the accuracy of semantic segmentation. Yet, the ability to estimate uncertainty and detect failure is key for safety-critical applications like autonomous driving. Existing uncertainty estimates have mostly been evaluated on simple tasks, and it is unclear whether these methods generalize to more complex scenarios. We present Fishyscapes, the first public benchmark for uncertainty estimation in a real-world task of semantic segmentation for urban driving. It evaluates pixel-wise uncertainty estimates towards the detection of anomalous objects in front of the vehicle. We~adapt state-of-the-art methods to recent semantic segmentation models and compare approaches based on softmax confidence, Bayesian learning, and embedding density. Our results show that anomaly detection is far from solved even for ordinary situations, while our benchmark allows measuring advancements beyond the state-of-the-art.


Introduction
Deep learning has had a high impact on the precision of computer vision methods [9,30,21,63] and enabled semantic understanding in robotic applications [47,19,40]. However, while these algorithms are usually compared on closed-world datasets with a fixed set of classes [24,11], the real-world is uncontrollable, and an incorrect reaction by an autonomous agent to an unexpected input can have disastrous consequences [6].
As such, to reach full autonomy while ensuring safety and reliability, decision-making systems need information about outliers and uncertain or ambiguous cases that might affect the quality of the perception output. As illustrated in Figure 1, deep convolutional neural networks (CNNs) react unpredictably for inputs that deviate from their training distribution. In the presence of outlier objects, this is interpolated with the available classes at high confidence. 1 Autonomous Systems Lab, ETH Zürich 2 Visual Geometry Group, ETH Zürich 3 Microsoft Research Correspondence: blumh@ethz.ch This work was partially supported by the HILTI Group.

Input
Learned Embedding Density Prediction streetsign road MC Dropout Figure 1. When exposed to an object type unseen during training, a state-of-the-art semantic segmentation model [9] predicts familiar labels (streetsign, road) with high confidence. To detect such failures, we evaluate various methods that assign a pixel-wise outof-distribution score, where higher values are darker. The blue outline is added for illustration.
Existing research to detect such behaviour is often labeled as out-of-distribution (OoD), anomaly, or novelty detection, and has so far focused on developing methods for image classification, evaluated on simple datasets like MNIST or CIFAR-10 [44,51,32,39,57,26,10,58,52]. How these methods generalize to more elaborate network architectures and pixel-wise uncertainty estimation has not been assessed in prior work. Motivated by these practical needs, we introduce 'Fishyscapes', a benchmark that evaluates uncertainty estimates for semantic segmentation. The benchmark measures how well methods detect potentially hazardous anomalies in driving scenes. Fishyscapes is based on data from Cityscapes [11], a popular benchmark for semantic segmentation in urban driving. Our benchmark consists of (i) Fishyscapes Web, where images from Cityscapes are overlayed with objects that are regularly crawled from the web in an open-world setup, and (ii) Fishyscapes Lost & Found, that builds up on a road hazard dataset collected with the same setup as Cityscapes [53] and that we supplemented with labels.
To provide a broad overview, we adapt a variety of methods to semantic segmentation that were originally designed for image classification. Because segmentation networks are much more complex and have high computational costs, this adaptation is not trivial, and we suggest different approximations to overcome these challenges.
Our experiments show that the embeddings of intermediate layers hold important information for anomaly detection. Based on recent work on generative models, we develop a novel method using density estimation in the embedding space. However, we also show that varying visual appearance can mislead feature-based and other methods. None of the evaluated methods achieves the accuracy required for safety-critical applications. We conclude that these remain open problems, with our benchmark enabling the community to measure progress and build upon the best performing methods so far.
To summarize, our contributions are the following: -We introduce the first public benchmark evaluating pixelwise uncertainty estimates in semantic segmentation, with a dynamic, self-updating dataset for anomaly detection.
-We report an extensive evaluation with diverse state-ofthe-art approaches to uncertainty estimation, adapted to the semantic segmentation task, and present a novel method for anomaly detection.
-We show a clear gap between the alleged capabilities of established methods and their performance on this realworld task, thereby confirming the necessity of our benchmark to support further research in this direction.

Related Work
Here we review the most relevant works in semantic segmentation and their benchmarks, and methods that aim at providing a confidence estimate of the output of deep networks.

Semantic Segmentation
State-of-the-art models are fully-convolutional deep networks trained with pixel-wise supervision. Most works [56,3,8,9] adopt an encoder-decoder architecture that initially reduces the spatial resolution of the feature maps, and subsequently upsamples them with learned transposed convolution, fixed bilinear interpolation, or unpooling. Additionally, dilated convolutions or spatial pyramid pooling enlarge the receptive field and improve the accuracy.
Popular benchmarks compare methods on the segmentation of objects [18] and urban scenes. In the latter case, Cityscapes [11] is a well-established dataset depicting street scenes in European cities with dense annotations for a limited set of classes. Efforts have been made to provide datasets with increased diversity, either in terms of environments, with WildDash [69], which incorporates data from numerous parts of the world, or with Mapillary [50], which adds many more classes. Recent data releases add multi-sensor and multi-modality recordings on top of that [64,25,7]. Like ours, some datasets are explicitly derived from Cityscapes, the most relevant being Foggy Cityscapes [61], which overlays synthetic fog onto the original dataset to evaluate more difficult driving conditions. The Robust Vision Challenge 1 also assesses generalization of learned models across different datasets.
Robustness and reliability are only evaluated by these benchmarks through ranking methods according to their accuracy, without taking into accounts the uncertainty of their predictions. Additionally, despite the fact that one cannot assume that models trained with closed-world data will only encounter known classes, these scenarios are rarely quantitatively evaluated. To our knowledge, WildDash [69] is the only public benchmark that explicitly reports uncertainty w.r.t. OoD examples. These are however drawn from a very limited set of full-image outliers, while we introduce a diverse set of objects, as WildDash mainly focuses on accuracy. Complementarily, the Dark Zurich dataset [62] allows for uncertainty-aware evaluation of semantic segmentation models with regard to deprived sensor inputs, i.e. evaluating aleatoric uncertainty.
Bevandic et al. [5] experiment with OoD objects for semantic segmentation by overlaying objects on Cityscapes images in a manner similar to ours. They however assume the availability of a large OoD dataset, which is not realistic in an open-world context, and thus mostly evaluate supervised methods. In contrast, we assess a wide range of methods that do not require OoD data. Mukhoti & Gal [48] introduce a new metric for uncertainty evaluation and are the first to quantitatively assess misclassification for segmentation. Yet they only compare few methods on normal in-distribution (ID) data. The MVTec benchmark [4] compares a range of anomaly segmentation methods on images of single objects to find industrial production anomalies. It mostly compare methods that focus on low-power computing. Following our work, the CAOS benchmark [31] also compares anomaly segmentation methods in simulated and real-world driving scenes. While their results confirm our finding that most established methods scale poorly to semantic segmentation, their methodology lacks open-world testing, which we argue later is important for true anomaly detection.

Uncertainty estimation
There is a large body of work that aims at detecting OoD data or misclassification by defining uncertainty or confidence estimates.
Probabilistic modeling of a neural network's output is a straightforward approach in uncertainty estimation. The softmax score, i.e. the classification probability of the predicted class, was shown to be a first baseline [32], although sensitive to adversarial examples [28]. Its performance was improved by ODIN [41], which applies noise to the input with the Fast Gradient Sign Method (FGSM) [28] and calibrates the score with temperature scaling [29]. Probabilistic modelling has been extended further in Deep Belief Networks that propagate activation distributions throughout the network [20,43].
Bayesian deep learning [22,35] adopts a probabilistic view by designing deep models whose outputs and weights are probability distributions instead of point estimates. Uncertainties are then defined as dispersions of such distributions, and can be of several types. Epistemic uncertainty, or model uncertainty, corresponds to the uncertainty over the model parameters that best fit the training data for a given model architecture. As evaluating the posterior over the weights is intractable in deep non-linear networks, recent works perform Monte-Carlo (MC) sampling with dropout [23] or ensembles [37]. Aleatoric uncertainty, or data uncertainty, arises from the noise in the input data, such as sensor noise. Both have been applied to semantic segmentation [35], and successively evaluated for misclassification detection [48], but only on ID data and not for OoD detection. Malinin & Gales [44] later single out distributional uncertainty to represent model misspecification with respect to OoD inputs. Their approach however was only applied to image classifications on toy datasets, and requires OoD data during the training stage. To address the latter constraint, Lee et al. [38] earlier proposed a Generative Adversarial Network (GAN) that generates OoD data as boundary samples. This is however very challenging to scale to complex and high-dimensional data like high-resolution images of urban scenes. Recently, Bayesian methods investigated the inductive bias of network structures beyond weights [67]. For example, [2] extracts meaningful uncertainties from an 'ensemble' of network activations at varying depth, and [68] employs a sampling scheme for architectures.
OoD and novelty detection is often tackled by non-Bayesian approaches. As such, feature introspection amounts to measuring discrepancies between distributions of deep features of training data and OoD samples, using either nearest neighbour (NN) statistics [51,46] or Gaussian approximations [39,65]. These methods have the benefit of working on any classification model without requiring specific training. Recently, connections between feature density and Bayesian uncertainties have been investigated [54]. On the other hand, approaches specifically tailored to perform OoD detection include one-class classification [57,26], which aim at creating discriminative embeddings, density estimation [10,49], which estimate the likelihood of samples w.r.t to the true data distribution, and generative reconstruction [58,52,27], which use the quality of auto-encoder reconstructions to discriminate OoD samples. Richter et al. [55] apply the latter to simple real images recorded by a robotic car and successfully detect new environments.

Benchmark Design
Because it is not possible to produce ground truth for uncertainty values, evaluating estimators is not a straightforward task. We thus compare them on the proxy classification task [32] of detecting anomalous inputs. The uncertainty estimates are seen as scores of a binary classifier that compares the score against a threshold and whose performance reflects the suitability of the estimated uncertainty for anomaly detection.
Such an approach however introduces a major issue for the design of a public OoD detection benchmark. With publicly available ID training data A and OoD inputs B, it is not possible to distinguish between an uncertainty method that informs a classifier to discriminate A from any other input, and a classifier trained to discriminate A from B. The latter option clearly does not represent progress towards the goal of general uncertainty estimation, but rather overfitting.
To this end, we (i) only release a small validation set with associated ground truth masks, while keeping larger test sets hidden, (ii) continuously evaluate submitted methods against a dynamically changing, synthetic dataset, and (iii) compare the performance on the dynamic dataset with evaluations on real-world data. Additionally, all submissions to the benchmark must indicate whether any OoD data was used during training, which is cross-checked with linked publications.
Examples from all benchmark datasets are shown in figure 2.

Does the method work in an open world?
The open world scenario describes the problem that an autonomous agent who is freely interacting with the world has to be able to deal with the unexpected at all times. To test perception methods in an open world scenario, a benchmark therefore needs to present truly unexpected inputs. We argue that this is never truly possible with a fixed dataset that by design has limited diversity, and over time may simply identify those methods that deal best with the kind of objects included in the dataset. Instead, we propose a dynamically changing dataset that samples diverse objects at every iteration.
In general, there are three options to generate such dynamic datasets: At every iteration, one may (i) capture new data in the wild and annotate, (ii) render new objects in simulation, or (iii) capture new objects in the wild, but blend them into already annotated scenes. While data from the wild is essential to test methods in realistic settings, annotation for semantic segmentation is very expensive and not a sustainable way to generate new datasets multiple times per year. Between (ii) and (iii) there is an essential trade-off. Rendering in 3D ensures physically viable object placement and consistent lighting. Images of diverse objects in the wild are much better available than textured 3D models and can be blended into real-world scenes. We acknowledge that  The ground truth contains labels for ID (blue) and OoD (red) pixels, as well as ignored void pixels (black). We additionally show the output of the best method per dataset in column 4 and the best method without OoD training in the last column. We report the AP of each method output in its top right corner.
there is an ongoing debate whether photorealtistic rendering engines or modern blending techniques achieve more realistic images, which was touched upon by a response-work to this benchmark [31]. In this work, we decided to base our dataset FS Web on approach (iii). In the following, we describe a blending-based reference dataset FS Static and the dynamically changing dataset FS Web.
FS Static is based on the validation set of Cityscapes [11]. It has a limited visual diversity, which is important to make sure that it contains none of the overlayed objects. In addition, background pixels originally belonging to the void class 2 2 void in cityscapes is defined as: forms of horizontal ground-level structures that do not match any class, things that might not be there anymore the next day/hour/minute (e.g. movable trash bin, buggy, bag, wheelchair, animal), clutter in the background that is not distinguishable, or any objects that do not match a class (e.g. visible parts of the ego vehicle, mountains, street lights, back side of signs). are excluded from the evaluation, as they may be borderline OoD. Anomalous objects are extracted from the generic Pascal VOC [18] dataset using the associated segmentation masks. We only overlay objets from classes that cannot be found in Cityscapes: aeroplane, bird, boat, bottle, cat, chair, cow, dog, horse, sheep, sofa, tvmonitor. Objects cropped by the image borders or objects that are too small to be seen are filtered out. We randomly size and position the objects on the underlying image, making sure that none of the objects appear on the ego-vehicle. Objects from mammal classes have a higher probability of appearing on the lower-half of the screen, while classes like birds or airplanes have a higher probability for the upper half. The placing is not further limited to ensure each pixel in the image, apart from the ego-vehicle, is comparably likely to be anomalous. To match the image characteristics of cityscapes, we employ a series of postprocessing steps similar to those described in [1], without those steps that require 3D models of the objects to e.g. adapt shadows and lighting.To make the task of anomaly detection harder, we add synthetic fog [60,12] on the in-distribution pixels with a per-image probability. This prevents fraudulent methods to compare the input against a fixed set of Cityscapes images. The dataset is split into a minimal public validation set of 30 images and a hidden test set of 1000 images. It contains in total around 4.5e7 OoD and 1.8e9 ID pixels. The validation set only contains a small disjoint set of pascal objects to prevent few-shot learning on our data creation method.
FS Web is built similarly to FS Static, but with overlay objects crawled from the internet using a changing list of keywords. Our script searches for images with transparent background, uploaded in a recent timeframe, and filters out images that are too small. The only manual process is filtering out images that are not suitable, e.g. with decorative borders or watermarks. The dataset for March 2019 contains 4.9e7 OoD and 1.8e9 ID pixels. As the diversity of images and color distributions for the images from the web is much greater than those from Pascal VOC, we also adapt our overlay procedure. In total, we follow these steps, some of which were added from June 2019 onwards (marked with *): -in case the image does not already have a smooth alpha channel, smooth the mask of the objects around the borders for a small transparency gradient -adapt the brightness of the object towards the mean brightness of the overlayed pixels -apply the inverse color histogram of the Cityscapes image to shift the color distribution towards the one found on the underlying image* -radial motion blur* -depth blur based on the position in the image* -color noise -glow effects to simulate overexposure* Figure 3 shows an illustration of the blending results.
As discussed, the blending process is part of a trade-off to make an open-world dataset feasible. To further ensure that methods do not overfit to any artifacts created by the blending process, but detect anomalies based on their semantics and appearance, we include a sample of ID objects in the blending dataset. For this, we create a database from objects in the Cityscapes training dataset (car, person, truck, bus, train, bike) where we manually filter out any occluded instances. We then decide at random for every image whether to blend an anomalous object or a Cityscapes object, where we skip random placement and histogram adaptation for the latter. This addition was introduced in FS Web Jan 2020. An example can be seen in figure 2.
As indicated, the postprocessing was improved between iterations of the dataset. Because the purpose of the FS Web dataset is to measure any possible overfitting of the methods through a dynamically changing dataset, we will continue to refine also this image overlay procedure, updating our method with recent research results. Any update to the blending is also applied to the FS Static validation set, allowing submissions to validate the effect of blending improvements.

Does the method work on real images?
As discussed in section 3.1, capturing and annotating driving scenes multiple times per year is not sustainable, which made it necessary to use synthetic data generation for the dynamic dataset. However, for safe deployment it is equally important to test methods under real-world conditions. This is the purpose of the FS Lost & Found dataset in our benchmark. [53]. However, the original dataset only includes annotations for the anomalous objects and a coarse annotation of the road. It does not allow for appropriate evaluation of anomaly detection, as objects and road are very distinct in texture and it is more challenging to evaluate the anomaly score of the objects compared to eg. building structures. In order to make use of the full image, we add pixel-wise annotations that distinguish between objects (the anomalies), background (classes contained in Cityscapes) and void (anything not contained in Cityscapes classes that still appears in the training images). Additionally, we filter out those sequences where the 'road hazards' are children or bikes, because these are part of regular Cityscapes data and not anomalies. We subsample the repetitive sequences, labelling at least every sixth image, and remove images that do not contain objects. In total, we present a public validation set of 100 images and a testset of 275 images, based on disjoint sets of locations. While the Lost & Found images were captured with the same setup as Cityscapes, the distribution of street scenery is very different. The images were captured in small streets of housing areas, industrial areas, or on big parking lots. The anomalous objects are usually very small and are not equally distributed on the image. Nevertheless, the dataset allows to test for real images as opposed to synthetic data, therefore preventing any overfitting on synthetic image processing. This is especially important for parameter tuning on the validation set.

Metrics
We consider metrics associated with a binary classification task. Since the ID and OoD data is unbalanced, metrics based on the receiver operating curve (ROC) are not suitable [59]. We therefore base the ranking and primary evaluation on the average precision (AP). However, as the number of false positives in high-recall areas is particularly relevant for safety-critical applications, we additionally report the false positive rate at 95% recall (FPR 95 ). This metric was also used in [32] and emphasizes safety.
Semantic classification is not the goal of our benchmark, but uncertainty estimation and outlier detection should not come at high cost of segmentation accuracy. We therefore additionally report the mean intersection over union (IoU) of the semantic segmentation on the Cityscapes validation set.
For safety-critical systems, it is not only important to detect anomalies, but also to be fast enough to allow for a reaction. We therefore report the inference time of joint segmentation and anomaly detection per single frame. Times are measured over 500 images of the Cityscapes validation set on a GeForce 1080 Ti GPU.

Evaluated Methods
We now present the methods that are evaluated in Fishyscapes. In a first part, we describe the existing baselines and how we adapted them to the task of semantic segmentation. We then propose a novel method based on learned embedding density. Finally, we list those methods that were submitted to the public benchmark so far. All approaches are applied to the state-of-the-art semantic segmentation model DeepLab-v3+ [9]. Further implementation details are listed in the supplementary material.

Baselines
Softmax. The maximum softmax probability is a commonly used baseline and was evaluated in [32] for OoD detection. We apply the metric pixel-wise and additionally measure the softmax entropy, as proposed by [38], which captures more information from the softmax.
OoD training. While we generally strive for methods that are not biased by data, learning confidence from data is an obvious baseline and was explored in [13]. As we are not supposed to know the true OoD distribution, we do not use Pascal VOC, but rather approximate unknown pixels with the Cityscapes void class. In our evaluation, we (i) train a model to maximise the softmax entropy for OoD pixels, or (ii) introduce void as an additional output class and train with it. The uncertainty is then measured as (i) the softmax entropy, or (ii) the score of the void class.
Bayesian DeepLab was introduced by Mukhoti & Gal [48], following Kendall & Gal [35], and is the only uncertainty estimate already applied to semantic segmentation in the literature. The epistemic uncertainty is modeled by adding Dropout layers to the encoder, and approximated by T MC samples, while the aleatoric uncertainty corresponds to the spread of the categorical distribution. The total uncertainty is the predictive entropy of the distribution y, where y t c is the probability of class c for sample t. The epistemic uncertainty is measured as the mutual information (MI) between y and the weights w, Dirichlet DeepLab. Prior Networks [44] extend the framework of [22] by considering the predicted logits z as log concentration parameters α of a Dirichlet distribution, which is a prior of the predictive categorical distribution y. Intuitively, the spread of the Dirichlet prior should model the distributional uncertainty, and remain separate from the data uncertainty modelled by the spread of the categorical distribution. To this end, Malinin & Gales [44] advocate to train the network with the objective: + CrossEntropy(y, z).
The first term forces ID samples to produce sharp priors with a high concentration α in , computed as the product of smoothed labels and a fixed scale α 0 . The second term forces OoD samples to produce a flat prior with α out = 1, effectively maximizing the Dirichlet entropy, while the last one helps the convergence of the predictive distribution to the ground truth. We model pixel-wise Dirichlet distributions, approximate OoD samples with void pixels, and measure the Dirichlet differential entropy.
kNN Embedding. Different works [51,46] estimate uncertainty using kNN statistics between inferred embedding vectors and their neighbors in the training set. They then compare the classes of the neighbors to the prediction, where discrepancies indicate uncertainty. In more details, a given trained encoder maps a test image x to an embedding z l = f l (x ) at layer l, and the training set X to a set of neighbors Z l := f l (X). Intuitively, if x is OoD, then z is also differently distributed and has e.g. neighbors with different classes. Adapting these methods to semantic segmentation faces two issues: (i) The embedding of an intermediate layer of DeepLab is actually a map of embeddings, resulting in more than 10,000 kNN queries for each layer, which is computationally infeasible. We follow [46] and pick only one layer, selected using the FS Lost & Found validation set. (ii) The embedding map has a lower resolution than the input and a given training embedding z (i) l is therefore not associated with one, but with multiple output labels. As a baseline approximation, we link z (i) l to all classes in the associated image patch. The relative density [46] is then: Here, c i is the class of z (i) and c is the class of z in the downsampled prediction. In contrast to [46], we found that the cosine similarity from [51] works well without additional losses. Finally, we upsample the density of the feature map to the input size, assigning each pixel a density value. As the class association is unclear for encoder-decoder architectures, we also evaluate the density estimation with k neighbors independent of the class: This assumes that an OoD sample x , with a low density w.r.t X, should translate into z with a low density w.r.t. Z l .

Learned Embedding Density
We now introduce a novel approach that takes inspiration from density estimation methods while greatly improving their scalability and flexibilty.
Density estimation using kNN has two weaknesses. First, the estimation is a very coarse isotropic approximation, while the distribution in feature space might be significantly more complex. Second, it requires to store the embeddings of the entire training set and to run a large number of NN searches, both of which are costly, especially for large input images. On the other hand, recent works [10,49] on OoD detection leverage more complex generative models, such as normalizing flows [17,36,16], to directly estimate the density of the input sample x. This is however not directly applicable to our problem, as (i) learning generative models of images that can capture the entire complexity of e.g. urban scenes is still an open problem; and (ii) the pixel-wise density required here should be conditioned on a very (ideally infinitely) large context, which is computationally intractable.
Our approach mitigates these issues by learning the density of z. We start with a training set X drawn from the unknown true distribution x ∼ p * (x), and corresponding embeddings Z l . A normalizing flow with parameters θ is trained to approximate p * (z l ) by minimizing the negative log-likelihood (NLL) over all training embeddings in Z l : The flow is composed of a bijective function g θ that maps an embedding z l to a latent vector η of identical dimensionality and with Gaussian prior p(η) = N (η; 0, I). Its loglikelihood is then expressed as log p θ (z l ) = log p(η) + log det dg θ dz , and can be efficiently evaluated for some constrained g θ . At test time, we compute the embedding map of an input image, and estimate the NLL of each of its embeddings. In our experiments, we use the Real-NVP bijector [17], composed of a succession of affine coupling layers, batch normalizations, and random permutations. The benefits of this method are the following: (i) A normalizing flow can learn more complex distributions than the simple kNN kernel or mixture of Gaussians used by [39], where each embedding requires a class label, which is not available here; (ii) Features follow a simpler distribution than the input images, and can thus be correctly fit with simpler flows and shorter training times; (iii) The only hyperparameters are related to the architecture and the training of the flow, and can be cross-validated with the NLL of ID data without any OoD data; (iv) The training embeddings are efficiently summarized in the weights of the generative model with a very low memory footprint.
Input preprocessing [41] can be trivially applied to our approach. Since the NLL estimator is an end-to-end network, we can compute the gradients of the average NLL w.r.t. the input image by backpropagating through the flow and the encoder.
A flow ensemble can be built by training separate density estimators over different layers of the segmentation model, similar to [39]. However, the resulting NLL estimates cannot be directly aggregated as is, because the different embedding distributions have varying dispersions and dimensions, and thus densities with very different scales. We propose to normalize the NLL N (z l ) of a given embedding by the average NLL of the training features for that layer: This is in fact a MC approximation of the differential entropy of the flow, which is intractable. In the ideal case of a multivariate Gaussian,N corresponds to the Mahalanobis distance used by [39]. We can then aggregate the normalized, resized scores over different layers. We experiment with two strategies: (i) Using the minimum detects a pixel as OoD only if it has low likelihood through all layers, thus accounting for areas in the feature space that are in-distribution but contain only few training points; (ii) Following [39], taking a weighted average , with weights given by a logistic regression fit on the FS Lost & Found validation set, captures the interaction between the layers.

Submitted Methods
The following methods were submitted to our benchmark since it went online in August 2019. They were not implemented or trained by us, but we include an overview since they are part of the benchmark results.
An outlier head can be added in a multi-task fashion to many semantic segmentation architectures. [5] trains the head in a supervised fashion on both ID and OoD data samples. The training is executed simultaneously with the segmentation training. The outlier detection head then returns a pixel-wise anomaly score. Submitted were three variants of this method where the exact descriptions are in submission for publication.
Image Resynthesis uses reconstruction to estimate the fit of an input to the training data distribution of a generative model. While auto-encoders such as described in section 2 scale poorly to the level of detail in urban driving, good results have been achieved with generative adversarial networks [66,33] that synthesize driving scenes from semantic segmentation. [42] uses such a method to find outliers by comparing the original and resynthesized image, where they train the comparison on flipped semantic labels in the ID data and therefore do not require outliers in training. While the original work [42] experimented with lower resolution segmentation data, [14] submitted an adapted, scaled-up model.
Synboost is a modular approach that combines introspective uncertainties and input reconstruction into a pixel-wise dissimilarity score. Further details are described in [14].

Discussion of Results
We show in Table 1 the results of our benchmark as of December 2020 for the aforementioned datasets and methods. Qualitative examples of all methods are shown in figure 5.
Softmax Confidence. Confirming findings on simpler tasks [39], the softmax confidence is not a reliable score for anomaly detection. While training with OoD data clearly improves the softmax-based detection, it is not much better than Bayesian DeepLab, that does not require such data.
Difference between datasets. For most methods, there is a clear performance gap between the data from Lost & Found and the other datasets. We attribute this to two factors. First, the dataset contains a lot of images with only very small objects. This is indicated by the AP of the random classifier, which equals to the fraction of anomalous pixels.  Table 1. Benchmark Results. The gray columns mark the primary metric of the benchmark. Methods are only evaluated on those FS Web datasets with object images appearing on the web after their submission date. For every metric and dataset, the best performance is marked bold and the best performance without OoD training is marked italic.
DeepLab or Softmax Entropy. We further investigate the results on FS Web over time in figure 4. While most methods follow overall trends that can be attributed to the difficulty of the individual objects or differences in data balance, it becomes clear that (i) embedding based methods were picking up blending artifacts in FS Web March 2019, and (ii) Dirichlet DeepLab is performing very inconsistently. (i) appears to be fixed with the advanced blending from June 2019, since the introduction of blended ID objects did not have any effect on embedding based methods. (ii) could indicate a degree of overfitting to specific object types, because Dirichlet DeepLab is trained on OoD data.
Semantic Segmentation Accuracy. The data in table 1 illustrates a tradeoff between anomaly detection and segmentation performance. Methods like Bayesian DeepLab or Outlier Head are consistently among the best methods on all datasets, but need to train with special losses that reduce the segmentation accuracy by up to 10%. If segmentation accuracy is important, methods that do not require any retraining are particularly interesting.
Supervision with OoD data appears to be important for good anomaly detection. On every dataset, the best method required OoD data and is at least 38% better than any 'unsupervised' method. While training with OoD data can in principle lead to overfitting to specific objects, the results on FS Web, which was designed specifically to resemble openworld settings, show that the Outlier Head or Dissimilarity Ensemble are very robust to diverse anomalies. We however want to emphasize that anomaly detection and uncertainty estimation are very different principles. Our benchmark therefore serves the dual purpose of finding either the best anomaly segmentation method or well-scalable uncertainty estimates, that are simply tested on the proxy task of anomaly detection. Comparing Bayesian DeepLab and the void classifier shows that good uncertainty estimation methods can even compete with some supervised methods, but so far not with specifically designed anomaly segmentation methods.
Inference time differs significantly between methods. Methods can be broadly sorted into two categories, where the first do a single pass through a (sometimes modified) DeepLabv3+ architecture and the second category applies additional processing on top of this forward pass. Our measurements show that methods in the second category have up to two orders of magnitude higher inference time. The only exception marks the single-layer embedding density, where inference time is comparable to single pass methods. While nearly all methods 3 were executed as optimised tensorflow graphs, measurements are still dependent on the implementation details and possible parallelization is limited by GPU memory constraints. For example, the difference between softmax max-prob, softmax entropy, and dirichlet entropy can only be explained with inefficiencies in the softmax entropy implementation that cause a difference of more than 0.2 s.
Challenges in Method Adaptation. The results reveal that some methods cannot be easily adapted to semantic segmentation. For example, retraining required by special losses can impair the segmentation performance, and we found that these losses (e.g. for Dirichlet DeepLab) were often unstable during training or did not converge. Other challenges rise from the complex network structures which complicate the translation of class-based embedding methods such as deep k-nearest neighbor [51] to segmentation. This is illustrated by the performance of our simple implementation.

Conclusion
In this work, we introduced Fishyscapes, a benchmark for anomaly detection in semantic segmentation for urban driving. Comparing state-of-the-art methods on this complex task for the first time, we draw multiple conclusions: -The softmax output from a standard classifier is a bad indicator for anomaly detection.
-Most of the better performing methods required special losses that reduce the semantic segmentation accuracy.
-Supervision of anomaly segmentation methods with OoD data consistently outperformed unsupervised methods even in open-world scenarios.
Overall, the methods compared in our benchmark so far leave a lot of room for improvement. To safely deploy semantic segmentation methods in autonomous cars, further research is required. As a public benchmark, Fishyscapes supports the evaluation of new methods on urban driving scenarios.

A. Misclassification Detection
Additionally to anomaly detection, we test some methods on the detection of misclassifications from the semantic segmentation output. Misclassification detection is another proxy classification task that correlates with uncertainty. However, misclassification mixes uncertainty from -noise in the input (aleatoric uncertainty) -model uncertainty -shifts in data balance (softmax classification implicitly learns a prior distribution of the classes over the training set) Nevertheless, failure detection is an important problem for deployment on autonomous agents, e.g. as part of sensor fusion mechanisms, and misclassification detection is used in different related work [32,37,29,34] to benchmark uncertainty estimates.
Dataset. We test misclassification detection on a diverse mixture of different data sources that introduce sources of uncertainty in the input. From Foggy Driving [60], we select all images. From Foggy Zurich [12], we map classes sky and fence to void, as their labelling is not accurate and sometimes areas that are not visible due to fog are simply labelled sky. For WildDash [69], we use all images. For Mapillary Vistas [50], we sample 50 random images from the validation  set and apply the label mapping described in Table 2.
During evaluation all pixels labelled as void are ignored.
Evaluated Methods From the methods evaluated on anomaly detection, we note that the void classifier produces meaningless results for misclassification detection since a high void output score produces the exact misclassification it is detecting. Furthermore, we did not evaluate the learned embedding density.
Results of our evaluation are presented in table 3  sults. For Bayesian DeepLab, we find the predictive entropy to be a better indicator of misclassification, which was also observed by [35]. The kNN density shows results similar to the other methods, hinting that embedding-based methods cannot be entirely classified as OoD-specific, but may also be able to detect input noise that is very different from the training distribution. Overall, the experiments do not reveal a single method that performs significantly better than others.

B. Details on the Methods
In this section we provide implementation details on the evaluated methods to ease the reproducibility of the results presented in this paper.

B.1. Semantic Segmentation Model
We use the state-of-the-art model DeepLabv3+ [9] with Xception-71 backbone, image-level features, and dense prediction cell. When no retraining is required, we use the original model trained on Cityscapes 4 .

B.2. Softmax
ODIN [41] applies input preprocessing and temperature scaling to improve the OoD detection ability of the maximum softmax probability. Early experiments on Fishyscapes showed that (i) temperature scaling did not improve much the results of this baseline, and (ii) input preprocessing w.r.t. the softmax score is not possible due to the limited GPU memory and the large size of the DeepLab model. As the maximum probability is anyway not competitive with respect to the other methods, we decided to not further develop that baseline.

B.3. Bayesian DeepLab
We reproduce the setup described by Mukhoti & Gal [48]. As such, we use the Xception-65 backbone pretrained on ImageNet, and insert dropout layers in its middle flow. We train for 90k iterations, with a batch size of 16, a crop size of 513 × 513, and a learning rate of 7 · 10 −3 with polynomial decay.

B.4. Dirichlet DeepLab
Following Malinin & Gales [44], we interpret the output logits of DeepLab as log-concentration parameters α and train with the loss described by Equation (3) and implemented with the TensorFlow Probability [15] framework. For the first term, the target labels are smoothed with = 0.01 and scaled by α 0 = 100 to obtain target concentrations. To ensure convergence of the classifier, we found it necessary to downweight both the first and second terms by 0.1 and to initialize all but the last layer with the original DeepLab weigths.
We also tried to replace the first term by the negative log-likelihood of the Dirichlet distribution but were unable to make the training converge.

B.5. kNN Embedding
Layer of Embedding. As explained in Section 4.1, we had to restrict the kNN queries to one layer. A single layer of the network already has more than 10000 embedding vectors and we need to find k nearest neighbors for all of them. Querying over multiple layers therefore becomes infeasible. To select a layer of the network, we test multiple candidates on the FS Lost & Found validation set. We experienced that our kNN fitting with hnswlib 5 [45] was not deterministic, therefore we provide the average performance on the validation set over 3 different experiments. Additionally, we had to reduce the complexity of kNN fitting by randomly sampling 1000 images from Cityscapes instead of the whole training set (2975 images).
For the kNN density, we provide the results for different layers in Table 4  For class-based embedding, we perform a similar search for the choice of layer. The result can be found in Table 5.  Table 5. Parameter search of the embedding layer for class based relative kNN density. The AP is computed on the validation set of FS Static. Based on these results, we use the layer xception 71/exit flow/block2 in all our experiments.
Number of Neighbors. We select k according to Tables 6  and 7. All values are measured with the same kNN fitting. As the computational time for each query grows with k, small values are preferable. Note that by definition, the relative class density needs a sufficiently high k such that not all neighbors are from the same class.

B.6. Learned Embedding Density
Flow architecture. The normalizing flow follows the simple architecture of Real-NVP. We stack 32 steps, each one composed of an affine coupling layer, a batch normalization layer, and a fixed random permutation. As recommended by [36], we initialize the weights of the coupling layers such that they initially perform identity transformations.  Table 6. Parameter search for the number of nearest neighbors for kNN embedding density. As computing time increases with k, we select k = 20.  Table 7. Parameter search for the number of nearest neighbors for the class based kNN relative density. As computing time increases with k, we select k = 100.
Flow training. For a given DeepLab layer, we export the embeddings computed on all the images of the Cityscapes training set. The number of such datapoints depends on the stride of the layer, and amounts to 22M for a stride of 16. We keep 2000 of them for validation and testing, and train on the remaining embeddings for 200k iterations, with a learning rate of 10 −4 , and the Adam optimizer. Note that we can compare flow models based on how well they fit the in-distribution embeddings, and thus do not require any OoD data for hyperparameter search.
Layer selection. OoD data is only required to select the layer at which the embeddings are extracted. The corresponding feature space should best separate OoD and ID data, such that OoD embeddings are assigned low likelihood. We found that it is critical to extract embeddings before ReLU activations, as some dimensions might be negative for all training points, thus making the training highly unstable. We show in Table 8 the AP on the FS Lost & Found validation set for different layers. We first observe that we did not achieve training convergence for those layers that showed best results in the kNN method. This may be due to the high dimensionality of these layers, and/or because the flow is not well suited to approximate these distributions. We also notice that overall layers in the encoder middle flow work best, while Mukhoti & Gal [48] insert dropout layers at this particular stage. While we do not know the reason behind their design decision, we hypothesize the they found these layers to best model the epistemic uncertainty.  Table 8. Cross-validation of the embedding layer for the learned density. The AP is computed on the validation set of FS Lost & Found. Based on these results, we use the layer decoder conv1 0 in all our experiments. We could not manage to make the training of the aspp features layer converge, most likely due to a very peaky distribution that induces numerical instabilities.
Effect of input preprocessing. As previously reported by [41,39], we observe that this simple input preprocessing brings substantial improvements to the detection score on the test set. We show in Table 9 the AP for different noise magnitudes .  Table 9. Cross-validation of the input preprocessing for the learned density. Based on these results, we apply noise with magnitude = 0.25 in all our experiments.