1 Introduction

Deep learning has had a high impact on the precision of computer vision methods (Chen et al., 2018; He et al., 2017; Fu et al., 2018; Sun et al., 2018) and enabled semantic understanding in robotic applications (Mccormac et al., 2018; Florence et al., 2018; Liang et al., 2018). However, while these algorithms are usually compared on closed-world datasets with a fixed set of classes (Geiger et al., 2012; Cordts et al., 2016), the real-world is uncontrollable, and an incorrect reaction by an autonomous agent to an unexpected input can have disastrous consequences (Bozhinoski et al., 2019).

As such, to reach full autonomy while ensuring safety and reliability, decision-making systems need information about outliers and uncertain or ambiguous cases that might affect the quality of the perception output. As illustrated in Fig. 1, deep convolutional neural networks (CNNs) react unpredictably for inputs that deviate from their training distribution. In the presence of outlier objects, this is interpolated with the available classes at high confidence. Existing research to detect such behaviour is often labeled as out-of-distribution (OoD), anomaly, or novelty detection, and has so far focused on developing methods for image classification, evaluated on simple datasets like MNIST or CIFAR-10 (Malinin and Gales, 2018; Papernot & McDaniel, 2018; Hendrycks & Gimpel, 2017; Lee et al., 2018; Ruff et al., 2018; Golan & El-Yaniv, 2018; Choi et al., 2018; Sabokrou et al., 2018; Pidhorskyi et al., 2018). How these methods generalize to more elaborate network architectures and pixel-wise uncertainty estimation has not been assessed in prior work.

Motivated by these practical needs, we introduce ‘Fishyscapes’, a benchmark that evaluates uncertainty estimates for semantic segmentation. The benchmark measures how well methods detect potentially hazardous anomalies in driving scenes. Fishyscapes is based on data from Cityscapes (Cordts et al., 2016), a popular benchmark for semantic segmentation in urban driving. Our benchmark consists of (i) Fishyscapes Web, where images from Cityscapes are overlayed with objects that are regularly crawled from the web in an open-world setup, and (ii) Fishyscapes Lost and Found, that builds up on a road hazard dataset collected with the same setup as Cityscapes (Pinggera et al., 2016) and that we supplemented with labels.

To provide a broad overview, we adapt a variety of methods to semantic segmentation that were originally designed for image classification. Because segmentation networks are much more complex and have high computational costs, this adaptation is not trivial, and we suggest different approximations to overcome these challenges.

Fig. 1
figure 1

When exposed to an object type unseen during training, a state-of-the-art semantic segmentation model (Chen et al., 2018) predicts familiar labels (streetsign, road) with high confidence. To detect such failures, we evaluate various methods that assign a pixel-wise out-of-distribution score, where higher values are darker. The blue outline is added for illustration.

Our experiments show that the embeddings of intermediate layers hold important information for anomaly detection. Based on recent work on generative models, we develop a novel method using density estimation in the embedding space. However, we also show that varying visual appearance can mislead feature-based and other methods. None of the evaluated methods achieves the accuracy required for safety-critical applications. We conclude that these remain open problems, with our benchmark enabling the community to measure progress and build upon the best performing methods so far.

To summarize, our contributions are the following:

  • We introduce the first public benchmark evaluating pixel-wise uncertainty estimates in semantic segmentation, with a dynamic, self-updating dataset for anomaly detection.

  • We report an extensive evaluation with diverse state-of-the-art approaches to uncertainty estimation, adapted to the semantic segmentation task, and present a novel method for anomaly detection.

  • We show a clear gap between the alleged capabilities of established methods and their performance on this real-world task, thereby confirming the necessity of our benchmark to support further research in this direction.

2 Related Work

Here we review the most relevant works in semantic segmentation and their benchmarks, and methods that aim at providing a confidence estimate of the output of deep networks.

2.1 Semantic Segmentation

State-of-the-art models are fully-convolutional deep networks trained with pixel-wise supervision. Most works (Ronneberger et al., 2015; Badrinarayanan et al. 2017; Chen et al., 2016; Chen et al., 2018) adopt an encoder-decoder architecture that initially reduces the spatial resolution of the feature maps, and subsequently upsamples them with learned transposed convolution, fixed bilinear interpolation, or unpooling. Additionally, dilated convolutions or spatial pyramid pooling enlarge the receptive field and improve the accuracy.

Popular benchmarks compare methods on the segmentation of objects (Everingham et al., 2010) and urban scenes. In the latter case, Cityscapes (Cordts et al., 2016) is a well-established dataset depicting street scenes in European cities with dense annotations for a limited set of classes. Efforts have been made to provide datasets with increased diversity, either in terms of environments, with WildDash (Zendel et al. 2018), which incorporates data from numerous parts of the world, or with Mapillary (Neuhold et al., 2017), which adds many more classes. Recent data releases add multi-sensor and multi-modality recordings on top of that (Sun et al., 2020; Geyer et al., 2020; Caesar et al., 2020). Like ours, some datasets are explicitly derived from Cityscapes, the most relevant being Foggy Cityscapes (Sakaridis et al., 2018), which overlays synthetic fog onto the original dataset to evaluate more difficult driving conditions. The Robust Vision ChallengeFootnote 1 also assesses generalization of learned models across different datasets.

Robustness and reliability are only evaluated by these benchmarks through ranking methods according to their accuracy, without taking into accounts the uncertainty of their predictions. Additionally, despite the fact that one cannot assume that models trained with closed-world data will only encounter known classes, these scenarios are rarely quantitatively evaluated. To our knowledge, WildDash (Zendel et al. 2018) is the only public benchmark that explicitly reports uncertainty w.r.t. OoD examples. These are however drawn from a very limited set of full-image outliers, while we introduce a diverse set of objects, as WildDash mainly focuses on accuracy. Complementarily, the Dark Zurich dataset (Sakaridis et al., 2020) allows for uncertainty-aware evaluation of semantic segmentation models with regard to deprived sensor inputs, i.e. evaluating aleatoric uncertainty.

Bevandic et al. (2019) experiment with OoD objects for semantic segmentation by overlaying objects on Cityscapes images in a manner similar to ours. They however assume the availability of a large OoD dataset, which is not realistic in an open-world context, and thus mostly evaluate supervised methods. In contrast, we assess a wide range of methods that do not require OoD data. Mukhoti and Gal (2018) introduce a new metric for uncertainty evaluation and are the first to quantitatively assess misclassification for segmentation. Yet they only compare few methods on normal ID data. The MVTec benchmark (Bergmann et al., 2019) compares a range of anomaly segmentation methods on images of single objects to find industrial production anomalies. It mostly compare methods that focus on low-power computing. Following our work, the CAOS benchmark (Hendrycks et al., 2019) also compares anomaly segmentation methods in simulated and real-world driving scenes. While their results confirm our finding that most established methods scale poorly to semantic segmentation, their methodology lacks open-world testing, which we argue later is important for true anomaly detection.

2.2 Uncertainty Estimation

There is a large body of work that aims at detecting OoD data or misclassification by defining uncertainty or confidence estimates.

Probabilistic modeling of a neural network’s output is a straightforward approach in uncertainty estimation. The softmax score, i.e. the classification probability of the predicted class, was shown to be a first baseline (Hendrycks & Gimpel, 2017), although sensitive to adversarial examples (Goodfellow et al., 2015). Its performance was improved by ODIN (Liang et al., 2018), which applies noise to the input with the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) and calibrates the score with temperature scaling (Guo et al., 2017). Probabilistic modelling has been extended further in Deep Belief Networks that propagate activation distributions throughout the network (Frey and Hinton, 1999; Loquercio et al., 2020).

Bayesian deep learning (Gal, 2016; Kendall and Gal, 2017) adopts a probabilistic view by designing deep models whose outputs and weights are probability distributions instead of point estimates. Uncertainties are then defined as dispersions of such distributions, and can be of several types. Epistemic uncertainty, or model uncertainty, corresponds to the uncertainty over the model parameters that best fit the training data for a given model architecture. As evaluating the posterior over the weights is intractable in deep non-linear networks, recent works perform (MC) sampling with dropout (Gal & Ghahramani, 2016) or ensembles (Lakshminarayanan et al., 2017). Aleatoric uncertainty, or data uncertainty, arises from the noise in the input data, such as sensor noise. Both have been applied to semantic segmentation (Kendall and Gal, 2017), and successively evaluated for misclassification detection (Mukhoti & Gal, 2018), but only on ID data and not for OoD detection. Malinin and Gales (2018) later single out distributional uncertainty to represent model misspecification with respect to OoD inputs. Their approach however was only applied to image classifications on toy datasets, and requires OoD data during the training stage. To address the latter constraint, Lee et al. (2018) earlier proposed a Generative Adversarial Network (GAN) that generates OoD data as boundary samples. This is however very challenging to scale to complex and high-dimensional data like high-resolution images of urban scenes. Recently, Bayesian methods investigated the inductive bias of network structures beyond weights (Wilson & Izmailov, 2020). For example, Antoran et al. (2020) extracts meaningful uncertainties from an ‘ensemble’ of network activations at varying depth, and Yehezkel Rohekar et al. (2019) employs a sampling scheme for architectures.

OoD and novelty detection is often tackled by non-Bayesian approaches. As such, feature introspection amounts to measuring discrepancies between distributions of deep features of training data and OoD samples, using either (NN) statistics (Papernot & McDaniel, 2018; Mandelbaum & Weinshall, 2017) or Gaussian approximations (Lee et al., 2018; Amersfoort et al., 2020). These methods have the benefit of working on any classification model without requiring specific training. Recently, connections between feature density and Bayesian uncertainties have been investigated (Postels et al., 2020). On the other hand, approaches specifically tailored to perform OoD detection include one-class classification (Ruff et al., 2018; Golan & El-Yaniv, 2018), which aim at creating discriminative embeddings, density estimation (Choi et al., 2018; Nalisnick et al., 2019), which estimate the likelihood of samples w.r.t to the true data distribution, and generative reconstruction (Sabokrou et al., 2018; Pidhorskyi et al., 2018; Gong et al., 2019), which use the quality of auto-encoder reconstructions to discriminate OoD samples. Richter and Roy (2017) apply the latter to simple real images recorded by a robotic car and successfully detect new environments.

3 Benchmark Design

Because it is not possible to produce ground truth for uncertainty values, evaluating estimators is not a straightforward task. We thus compare them on the proxy classification task (Hendrycks & Gimpel, 2017) of detecting anomalous inputs. The uncertainty estimates are seen as scores of a binary classifier that compares the score against a threshold and whose performance reflects the suitability of the estimated uncertainty for anomaly detection.

Such an approach however introduces a major issue for the design of a public OoD detection benchmark. With publicly available ID training data A and OoD inputs B, it is not possible to distinguish between an uncertainty method that informs a classifier to discriminate A from any other input, and a classifier trained to discriminate A from B. The latter option clearly does not represent progress towards the goal of general uncertainty estimation, but rather overfitting.

To this end, we (i) only release a small validation set with associated ground truth masks, while keeping larger test sets hidden, (ii) continuously evaluate submitted methods against a dynamically changing, synthetic dataset, and (iii) compare the performance on the dynamic dataset with evaluations on real-world data. Additionally, all submissions to the benchmark must indicate whether any OoD data was used during training, which is cross-checked with linked publications.

Examples from all benchmark datasets are shown in Fig. 2.

Fig. 2
figure 2

Qualitative examples of Fishyscapes Static (rows 1–2) and Fishyscapes Web (rows 3–5) and Fishyscapes Lost and Found (rows 6–8). The ground truth contains labels for ID (blue) and OoD (red) pixels, as well as ignored void pixels (black). We additionally show the output of the best method per dataset in column 4 and the best method without OoD training in the last column. We report the AP of each method output in its top right corner (Color figure online).

3.1 Does the Method Work in an Open World?

The open world scenario describes the problem that an autonomous agent who is freely interacting with the world has to be able to deal with the unexpected at all times. To test perception methods in an open world scenario, a benchmark therefore needs to present truly unexpected inputs. We argue that this is never truly possible with a fixed dataset that by design has limited diversity, and over time may simply identify those methods that deal best with the kind of objects included in the dataset. Instead, we propose a dynamically changing dataset that samples diverse objects at every iteration.

In general, there are three options to generate such dynamic datasets: At every iteration, one may (i) capture new data in the wild and annotate, (ii) render new objects in simulation, or (iii) capture new objects in the wild, but blend them into already annotated scenes. While data from the wild is essential to test methods in realistic settings, annotation for semantic segmentation is very expensive and not a sustainable way to generate new datasets multiple times per year. Between (ii) and (iii) there is an essential trade-off. Rendering in 3D ensures physically viable object placement and consistent lighting. Images of diverse objects in the wild are much better available than textured 3D models and can be blended into real-world scenes. We acknowledge that there is an ongoing debate whether photorealtistic rendering engines or modern blending techniques achieve more realistic images, which was touched upon by a response-work to this benchmark (Hendrycks et al., 2019). In this work, we decided to base our dataset FS Web on approach (iii). In the following, we describe a blending-based reference dataset FS Static and the dynamically changing dataset FS Web.

FS Static is based on the validation set of Cityscapes (Cordts et al., 2016). It has a limited visual diversity, which is important to make sure that it contains none of the overlayed objects. In addition, background pixels originally belonging to the void classFootnote 2 are excluded from the evaluation, as they may be borderline OoD. Anomalous objects are extracted from the generic Pascal VOC (Everingham et al., 2010) dataset using the associated segmentation masks. We only overlay objets from classes that cannot be found in Cityscapes: aeroplane, bird, boat, bottle, cat, chair, cow, dog, horse, sheep, sofa, tvmonitor. Objects cropped by the image borders or objects that are too small to be seen are filtered out. We randomly size and position the objects on the underlying image, making sure that none of the objects appear on the ego-vehicle. Objects from mammal classes have a higher probability of appearing on the lower-half of the screen, while classes like birds or airplanes have a higher probability for the upper half. The placing is not further limited to ensure each pixel in the image, apart from the ego-vehicle, is comparably likely to be anomalous. To match the image characteristics of cityscapes, we employ a series of postprocessing steps similar to those described in Abu Alhaija et al. (2018), without those steps that require 3D models of the objects to e.g. adapt shadows and lighting.

To make the task of anomaly detection harder, we add synthetic fog (Sakaridis et al., 2018; Dai et al., 2020) on the in-distribution pixels with a per-image probability. This prevents fraudulent methods to compare the input against a fixed set of Cityscapes images. The dataset is split into a minimal public validation set of 30 images and a hidden test set of 1000 images. It contains in total around 4.5e7 OoD and 1.8e9 ID pixels. The validation set only contains a small disjoint set of pascal objects to prevent few-shot learning on our data creation method.

Fig. 3
figure 3

Illustration of the blending process and improvements (v2) applied in June 2019. While color adaptation to the predominantly gray Cityscapes images is visually most obvious, important improvements in v2 include depth and motion blur, as well as glow effects.

FS Web is built similarly to FS Static, but with overlay objects crawled from the internet using a changing list of keywords. Our script searches for images with transparent background, uploaded in a recent timeframe, and filters out images that are too small. The only manual process is filtering out images that are not suitable, e.g. with decorative borders or watermarks. The dataset for March 2019 contains 4.9e7 OoD and 1.8e9 ID pixels. As the diversity of images and color distributions for the images from the web is much greater than those from Pascal VOC, we also adapt our overlay procedure. In total, we follow these steps, some of which were added from June 2019 onwards (marked with *):

  • in case the image does not already have a smooth alpha channel, smooth the mask of the objects around the borders for a small transparency gradient

  • adapt the brightness of the object towards the mean brightness of the overlayed pixels

  • apply the inverse color histogram of the Cityscapes image to shift the color distribution towards the one found on the underlying image*

  • radial motion blur*

  • depth blur based on the position in the image*

  • color noise

  • glow effects to simulate overexposure*

Figure 3 shows an illustration of the blending results.

As discussed, the blending process is part of a trade-off to make an open-world dataset feasible. To further ensure that methods do not overfit to any artifacts created by the blending process, but detect anomalies based on their semantics and appearance, we include a sample of ID objects in the blending dataset. For this, we create a database from objects in the Cityscapes training dataset (car, person, truck, bus, train, bike) where we manually filter out any occluded instances. We then decide at random for every image whether to blend an anomalous object or a Cityscapes object, where we skip random placement and histogram adaptation for the latter. This addition was introduced in FS Web Jan 2020. An example can be seen in Fig. 2.

As indicated, the postprocessing was improved between iterations of the dataset. Because the purpose of the FS Web dataset is to measure any possible overfitting of the methods through a dynamically changing dataset, we will continue to refine also this image overlay procedure, updating our method with recent research results. Any update to the blending is also applied to the FS Static validation set, allowing submissions to validate the effect of blending improvements.

3.2 Does the Method Work on Real Images?

As discussed in Sect. 3.1, capturing and annotating driving scenes multiple times per year is not sustainable, which made it necessary to use synthetic data generation for the dynamic dataset. However, for safe deployment it is equally important to test methods under real-world conditions. This is the purpose of the FS Lost and Found dataset in our benchmark.

FS Lost and Found is based on the original Lost and Found dataset (Pinggera et al., 2016). However, the original dataset only includes annotations for the anomalous objects and a coarse annotation of the road. It does not allow for appropriate evaluation of anomaly detection, as objects and road are very distinct in texture and it is more challenging to evaluate the anomaly score of the objects compared to eg. building structures. In order to make use of the full image, we add pixel-wise annotations that distinguish between objects (the anomalies), background (classes contained in Cityscapes) and void (anything not contained in Cityscapes classes that still appears in the training images). Additionally, we filter out those sequences where the ‘road hazards’ are children or bikes, because these are part of regular Cityscapes data and not anomalies. We subsample the repetitive sequences, labelling at least every sixth image, and remove images that do not contain objects. In total, we present a public validation set of 100 images and a testset of 275 images, based on disjoint sets of locations.

While the Lost and Found images were captured with the same setup as Cityscapes, the distribution of street scenery is very different. The images were captured in small streets of housing areas, industrial areas, or on big parking lots. The anomalous objects are usually very small and are not equally distributed on the image. Nevertheless, the dataset allows to test for real images as opposed to synthetic data, therefore preventing any overfitting on synthetic image processing. This is especially important for parameter tuning on the validation set.

3.3 Metrics

We consider metrics associated with a binary classification task. Since the ID and OoD data is unbalanced, metrics based on the (ROC) are not suitable (Saito & Rehmsmeier, 2015). We therefore base the ranking and primary evaluation on the (AP). However, as the number of false positives in high-recall areas is particularly relevant for safety-critical applications, we additionally report the false positive rate at 95% recall (\(\text {FPR}_\text {95}\)). This metric was also used in Hendrycks and Gimpel (2017) and emphasizes safety.

Semantic classification is not the goal of our benchmark, but uncertainty estimation and outlier detection should not come at high cost of segmentation accuracy. We therefore additionally report the mean (IoU) of the semantic segmentation on the Cityscapes validation set.

For safety-critical systems, it is not only important to detect anomalies, but also to be fast enough to allow for a reaction. We therefore report the inference time of joint segmentation and anomaly detection per single frame. Times are measured over 500 images of the Cityscapes validation set on a GeForce 1080 Ti GPU.

4 Evaluated Methods

We now present the methods that are evaluated in Fishyscapes. In a first part, we describe the existing baselines and how we adapted them to the task of semantic segmentation. We then propose a novel method based on learned embedding density. Finally, we list those methods that were submitted to the public benchmark so far.

All approaches are applied to the state-of-the-art semantic segmentation model DeepLab-v3+ (Chen et al., 2018). Further implementation details are listed in the supplementary material.

4.1 Baselines

Softmax The maximum softmax probability is a commonly used baseline and was evaluated in Hendrycks and Gimpel (2017) for OoD detection. We apply the metric pixel-wise and additionally measure the softmax entropy, as proposed by Lee et al. (2018), which captures more information from the softmax.

OoD training While we generally strive for methods that are not biased by data, learning confidence from data is an obvious baseline and was explored in DeVries and Taylor (2018). As we are not supposed to know the true OoD distribution, we do not use Pascal VOC, but rather approximate unknown pixels with the Cityscapes void class. In our evaluation, we (i) train a model to maximise the softmax entropy for OoD pixels, or (ii) introduce void as an additional output class and train with it. The uncertainty is then measured as (i) the softmax entropy, or (ii) the score of the void class.

Bayesian DeepLab was introduced by Mukhoti and Gal (2018), following Kendall and Gal (2017), and is the only uncertainty estimate already applied to semantic segmentation in the literature. The epistemic uncertainty is modeled by adding Dropout layers to the encoder, and approximated by T (MC) samples, while the aleatoric uncertainty corresponds to the spread of the categorical distribution. The total uncertainty is the predictive entropy of the distribution \(\mathbf {y}\),

$$\begin{aligned} \hat{\mathbb {H}}\left[ \mathbf {y}|\mathbf {x}\right] = -\sum _c\left( \frac{1}{T}\sum _t y_c^t\right) \log \left( \frac{1}{T}\sum _t y_c^t\right) , \end{aligned}$$
(1)

where \(y_c^t\) is the probability of class c for sample t. The epistemic uncertainty is measured as the mutual information (MI) between \(\mathbf {y}\) and the weights \(\mathbf {w}\),

$$\begin{aligned} \hat{\mathbb {I}}\left[ \mathbf {y}, \mathbf {w} | \mathbf {x}\right] = \hat{\mathbb {H}}\left[ \mathbf {y}|\mathbf {x}\right] - \frac{1}{T}\sum _{c, t} y_c^t\log y_c^t. \end{aligned}$$
(2)

Dirichlet DeepLab Prior networks (Malinin and Gales, 2018) extend the framework of Gal (2016) by considering the predicted logits \(\mathbf {z}\) as log concentration parameters \(\varvec{\alpha }\) of a Dirichlet distribution, which is a prior of the predictive categorical distribution \(\mathbf {y}\). Intuitively, the spread of the Dirichlet prior should model the distributional uncertainty, and remain separate from the data uncertainty modelled by the spread of the categorical distribution. To this end, Malinin and Gales (2018) advocate to train the network with the objective:

(3)

The first term forces ID samples to produce sharp priors with a high concentration \(\varvec{\alpha }_\mathrm {in}\), computed as the product of smoothed labels and a fixed scale \(\alpha _0\). The second term forces OoD samples to produce a flat prior with \(\varvec{\alpha }_\mathrm {out}=\varvec{1}\), effectively maximizing the Dirichlet entropy, while the last one helps the convergence of the predictive distribution to the ground truth. We model pixel-wise Dirichlet distributions, approximate OoD samples with void pixels, and measure the Dirichlet differential entropy.

kNN Embedding. Different works (Papernot & McDaniel, 2018; Mandelbaum & Weinshall, 2017) estimate uncertainty using kNN statistics between inferred embedding vectors and their neighbors in the training set. They then compare the classes of the neighbors to the prediction, where discrepancies indicate uncertainty. In more details, a given trained encoder maps a test image \(\mathbf {x'}\) to an embedding \(\mathbf {z'}_l=\mathbf {f}_l(\mathbf {x'})\) at layer l, and the training set \(\mathbf {X}\) to a set of neighbors \(\mathbf {Z}_l := \mathbf {f}_l(\mathbf {X})\). Intuitively, if \(\mathbf {x'}\) is OoD, then \(\mathbf {z'}\) is also differently distributed and has e.g. neighbors with different classes. Adapting these methods to semantic segmentation faces two issues: (i) The embedding of an intermediate layer of DeepLab is actually a map of embeddings, resulting in more than 10,000 kNN queries for each layer, which is computationally infeasible. We follow Mandelbaum and Weinshall (2017) and pick only one layer, selected using the FS Lost and Found validation set. (ii) The embedding map has a lower resolution than the input and a given training embedding \(\mathbf {z}_l^{(i)}\) is therefore not associated with one, but with multiple output labels. As a baseline approximation, we link \(\mathbf {z}_l^{(i)}\) to all classes in the associated image patch. The relative density (Mandelbaum & Weinshall, 2017) is then:

$$\begin{aligned} D(\mathbf {z'}) = \frac{ \sum \limits _{i \in K, c' = c_i} \exp \left( - \frac{\mathbf {z'}\mathbf {z}^{(i)}}{|\mathbf {z'}|\, |\mathbf {z}^{(i)}|}\right) }{ \sum \limits _{i \in K} \exp \left( - \frac{\mathbf {z'}\mathbf {z}^{(i)}}{|\mathbf {z'}|\, |\mathbf {z}^{(i)}|}\right) }. \end{aligned}$$
(4)

Here, \(c_i\) is the class of \(\mathbf {z}^{(i)}\) and \(c'\) is the class of \(\mathbf {z'}\) in the downsampled prediction. In contrast to Mandelbaum and Weinshall (2017), we found that the cosine similarity from Papernot and McDaniel (2018) works well without additional losses. Finally, we upsample the density of the feature map to the input size, assigning each pixel a density value.

As the class association is unclear for encoder-decoder architectures, we also evaluate the density estimation with k neighbors independent of the class:

$$\begin{aligned} D(\mathbf {z'}) = \sum \limits _{i \in K} \exp \left( - \frac{\mathbf {z'}\mathbf {z}^{(i)}}{|\mathbf {z'}|\, |\mathbf {z}^{(i)}|}\right) . \end{aligned}$$
(5)

This assumes that an OoD sample \(\mathbf {x'}\), with a low density w.r.t \(\mathbf {X}\), should translate into \(\mathbf {z'}\) with a low density w.r.t. \(\mathbf {Z}_l\).

4.2 Learned Embedding Density

We now introduce a novel approach that takes inspiration from density estimation methods while greatly improving their scalability and flexibilty.

Table 1 Benchmark results

Density estimation using kNN has two weaknesses. First, the estimation is a very coarse isotropic approximation, while the distribution in feature space might be significantly more complex. Second, it requires to store the embeddings of the entire training set and to run a large number of NN searches, both of which are costly, especially for large input images. On the other hand, recent works (Choi et al., 2018; Nalisnick et al., 2019) on OoD detection leverage more complex generative models, such as normalizing flows (Dinh et al., 2017; Kingma & Dhariwal, 2018; Dinh et al., 2014), to directly estimate the density of the input sample \(\mathbf {x}\). This is however not directly applicable to our problem, as (i) learning generative models of images that can capture the entire complexity of e.g. urban scenes is still an open problem; and (ii) the pixel-wise density required here should be conditioned on a very (ideally infinitely) large context, which is computationally intractable.

Our approach mitigates these issues by learning the density of \(\mathbf {z}\). We start with a training set \(\mathbf {X}\) drawn from the unknown true distribution \(\mathbf {x} \sim p^*(\mathbf {x})\), and corresponding embeddings \(\mathbf {Z}_l\). A normalizing flow with parameters \(\varvec{\theta }\) is trained to approximate \(p^*(\mathbf {z}_l)\) by minimizing the negative log-likelihood (NLL) over all training embeddings in \(\mathbf {Z}_l\):

$$\begin{aligned} \mathcal {L}(\mathbf {Z}_l) = -\frac{1}{|\mathbf {Z}_l|} \sum _i \log p_{\varvec{\theta }}(\mathbf {z}_l^{(i)}). \end{aligned}$$
(6)

The flow is composed of a bijective function \(\mathbf {g}_{\varvec{\theta }}\) that maps an embedding \(\mathbf {z}_l\) to a latent vector \(\varvec{\eta }\) of identical dimensionality and with Gaussian prior \(p(\varvec{\eta }) = \mathcal N(\varvec{\eta };0,\mathbf {I})\). Its loglikelihood is then expressed as

$$\begin{aligned} \log p_{\varvec{\theta }}(\mathbf {z}_l) = \log p(\varvec{\eta }) + \log \left| \det \left( \frac{d\mathbf {g}_{\varvec{\theta }}}{d\mathbf {z}}\right) \right| , \end{aligned}$$
(7)

and can be efficiently evaluated for some constrained \(\mathbf {g}_{\varvec{\theta }}\). At test time, we compute the embedding map of an input image, and estimate the NLL of each of its embeddings. In our experiments, we use the Real-NVP bijector (Dinh et al., 2017), composed of a succession of affine coupling layers, batch normalizations, and random permutations.

The benefits of this method are the following: (i) A normalizing flow can learn more complex distributions than the simple kNN kernel or mixture of Gaussians used by Lee et al. (2018), where each embedding requires a class label, which is not available here; (ii) Features follow a simpler distribution than the input images, and can thus be correctly fit with simpler flows and shorter training times; (iii) The only hyperparameters are related to the architecture and the training of the flow, and can be cross-validated with the NLL of ID data without any OoD data; (iv) The training embeddings are efficiently summarized in the weights of the generative model with a very low memory footprint.

Input preprocessing (Liang et al., 2018) can be trivially applied to our approach. Since the NLL estimator is an end-to-end network, we can compute the gradients of the average NLL w.r.t. the input image by backpropagating through the flow and the encoder.

A flow ensemble can be built by training separate density estimators over different layers of the segmentation model, similar to Lee et al. (2018). However, the resulting NLL estimates cannot be directly aggregated as is, because the different embedding distributions have varying dispersions and dimensions, and thus densities with very different scales. We propose to normalize the NLL \(N(\mathbf {z}_l)\) of a given embedding by the average NLL of the training features for that layer:

$$\begin{aligned} \bar{N}(\mathbf {z}_l) = N(\mathbf {z}_l) - \mathcal {L}(\mathbf {Z}_l). \end{aligned}$$
(8)

This is in fact a (MC) approximation of the differential entropy of the flow, which is intractable. In the ideal case of a multivariate Gaussian, \(\bar{N}\) corresponds to the Mahalanobis distance used by Lee et al. (2018). We can then aggregate the normalized, resized scores over different layers. We experiment with two strategies: (i) Using the minimum detects a pixel as OoD only if it has low likelihood through all layers, thus accounting for areas in the feature space that are in-distribution but contain only few training points; (ii) Following Lee et al. (2018), taking a weighted average, with weights given by a logistic regression fit on the FS Lost and Found validation set, captures the interaction between the layers.

4.3 Submitted Methods

The following methods were submitted to our benchmark since it went online in August 2019. They were not implemented or trained by us, but we include an overview since they are part of the benchmark results.

An outlier head can be added in a multi-task fashion to many semantic segmentation architectures. Bevandic et al. (2019) trains the head in a supervised fashion on both ID and OoD data samples. The training is executed simultaneously with the segmentation training. The outlier detection head then returns a pixel-wise anomaly score. Submitted were three variants of this method where the exact descriptions are in submission for publication.

Image resynthesis uses reconstruction to estimate the fit of an input to the training data distribution of a generative model. While auto-encoders such as described in Sect. 2 scale poorly to the level of detail in urban driving, good results have been achieved with generative adversarial networks (Wang et al., 2018; Isola et al., 2017) that synthesize driving scenes from semantic segmentation. Lis et al. (2019) uses such a method to find outliers by comparing the original and resynthesized image, where they train the comparison on flipped semantic labels in the ID data and therefore do not require outliers in training. While the original work (Lis et al., 2019) experimented with lower resolution segmentation data, Di Biase et al. (2021) submitted an adapted, scaled-up model.

Synboost is a modular approach that combines introspective uncertainties and input reconstruction into a pixel-wise dissimilarity score. Further details are described in Di Biase et al. (2021).

Fig. 4
figure 4

Performance evolution over the different iterations of the FS Web dataset. We only plot the best-performing variant of each method. Methods that train on OoD data are plotted with dashed lines. Notable changes are the better blending method in June 19 and the inclusion of blended ID objects in January 20, which changed the data-balance.

5 Discussion of Results

We show in Table 1 the results of our benchmark as of December 2020 for the aforementioned datasets and methods. Qualitative examples of all methods are shown in Fig. 5.

Fig. 5
figure 5

Successful and failed examples for all methods on the Fishyscapes Lost and Found dataset. Input images overlayed with the evaluation labels are on the left, predicted anomaly scores on the right of each example pair. For every method, we show the best variant. The red circles highlight anomalies that are missed by the method or indistinguishable from noise.

Softmax confidence Confirming findings on simpler tasks (Lee et al., 2018), the softmax confidence is not a reliable score for anomaly detection. While training with OoD data clearly improves the softmax-based detection, it is not much better than Bayesian DeepLab, that does not require such data.

Difference between datasets For most methods, there is a clear performance gap between the data from Lost and Found and the other datasets. We attribute this to two factors. First, the dataset contains a lot of images with only very small objects. This is indicated by the AP of the random classifier, which equals to the fraction of anomalous pixels. Second, the qualitative examples show the more challenging nature of the Lost and Found dataset with e.g. false positives for the void classifier or outlier head, and cases where small anomalous objects are not detected at all e.g. for the Bayesian DeepLab or Softmax Entropy.

We further investigate the results on FS Web over time in Fig. 4. While most methods follow overall trends that can be attributed to the difficulty of the individual objects or differences in data balance, it becomes clear that (i) embedding based methods were picking up blending artifacts in FS Web March 2019, and (ii) Dirichlet DeepLab is performing very inconsistently. (i) appears to be fixed with the advanced blending from June 2019, since the introduction of blended ID objects did not have any effect on embedding based methods. (ii) could indicate a degree of overfitting to specific object types, because Dirichlet DeepLab is trained on OoD data.

Semantic segmentation accuracy The data in table 1 illustrates a tradeoff between anomaly detection and segmentation performance. Methods like Bayesian DeepLab or Outlier Head are consistently among the best methods on all datasets, but need to train with special losses that reduce the segmentation accuracy by up to 10%. If segmentation accuracy is important, methods that do not require any retraining are particularly interesting.

Supervision with OoD data appears to be important for good anomaly detection. On every dataset, the best method required OoD data and is at least 38% better than any ‘unsupervised’ method. While training with OoD data can in principle lead to overfitting to specific objects, the results on FS Web, which was designed specifically to resemble open-world settings, show that the Outlier Head or Dissimilarity Ensemble are very robust to diverse anomalies.

We however want to emphasize that anomaly detection and uncertainty estimation are very different principles. Our benchmark therefore serves the dual purpose of finding either the best anomaly segmentation method or well-scalable uncertainty estimates, that are simply tested on the proxy task of anomaly detection. Comparing Bayesian DeepLab and the void classifier shows that good uncertainty estimation methods can even compete with some supervised methods, but so far not with specifically designed anomaly segmentation methods.

Inference time differs significantly between methods. Methods can be broadly sorted into two categories, where the first do a single pass through a (sometimes modified) DeepLabv3+ architecture and the second category applies additional processing on top of this forward pass. Our measurements show that methods in the second category have up to two orders of magnitude higher inference time. The only exception marks the single-layer embedding density, where inference time is comparable to single pass methods. While nearly all methodsFootnote 3 were executed as optimised tensorflow graphs, measurements are still dependent on the implementation details and possible parallelization is limited by GPU memory constraints. For example, the difference between softmax max-prob, softmax entropy, and dirichlet entropy can only be explained with inefficiencies in the softmax entropy implementation that cause a difference of more than 0.2 s.

Challenges in method adaptation The results reveal that some methods cannot be easily adapted to semantic segmentation. For example, retraining required by special losses can impair the segmentation performance, and we found that these losses (e.g. for Dirichlet DeepLab) were often unstable during training or did not converge. Other challenges rise from the complex network structures which complicate the translation of class-based embedding methods such as deep k-nearest neighbor (Papernot & McDaniel, 2018) to segmentation. This is illustrated by the performance of our simple implementation.

6 Conclusion

In this work, we introduced Fishyscapes, a benchmark for anomaly detection in semantic segmentation for urban driving. Comparing state-of-the-art methods on this complex task for the first time, we draw multiple conclusions:

  • The softmax output from a standard classifier is a bad indicator for anomaly detection.

  • Most of the better performing methods required special losses that reduce the semantic segmentation accuracy.

  • Supervision of anomaly segmentation methods with OoD data consistently outperformed unsupervised methods even in open-world scenarios.

Overall, the methods compared in our benchmark so far leave a lot of room for improvement. To safely deploy semantic segmentation methods in autonomous cars, further research is required. As a public benchmark, Fishyscapes supports the evaluation of new methods on urban driving scenarios.