Wasserstein Dropout

Despite of its importance for safe machine learning, uncertainty quantification for neural networks is far from being solved. State-of-the-art approaches to estimate neural uncertainties are often hybrid, combining parametric models with explicit or implicit (dropout-based) ensembling. We take another pathway and propose a novel approach to uncertainty quantification for regression tasks, Wasserstein dropout, that is purely non-parametric. Technically, it captures aleatoric uncertainty by means of dropout-based sub-network distributions. This is accomplished by a new objective which minimizes the Wasserstein distance between the label distribution and the model distribution. An extensive empirical analysis shows that Wasserstein dropout outperforms state-of-the-art methods, on vanilla test data as well as under distributional shift, in terms of producing more accurate and stable uncertainty estimates.


Introduction
Having attracted great attention in both academia and digital economy, deep neural networks (DNNs, Goodfellow et al. (2016)) are about to become vital components of safetycritical applications.Examples are autonomous driving (Bojarski et al., 2016;Pomerleau, 1988) or medical diagnostics (Liu et al., 2014), where prediction errors potentially put humans at risk.These systems require methods that are robust not only under lab conditions (e.g.i.i.d.data sampling), but also under continuous domain shifts.Besides shifts in the data, the data distribution itself poses further challenges.Critical situations are (fortunately) rare and thus strongly under-represented in datasets.Despite their rareness, these critical situations have a significant impact on the safety of operations.This calls for comprehensive self-assessment capabilities of DNNs and recent uncertainty mechanisms can be seen as a step in that direction.
While a variety of approaches to model uncertainty of DNN predictions in regression tasks has been established, stable uncertainty quantification is still an open problem.Widely used techniques like Kendall and Gal (2017) and Lakshminarayanan et al. (2017) combine parametric and non-parametric (ensembling-based) mechanisms to account for aleatoric uncertainty (data noise) and epistemic uncertainty (model weight uncertainty).The employed Figure 1: Wasserstein dropout (left column) employs sub-networks to model aleatoric uncertainty, i.e. the heterogeneous noise of (in this case, toy) datasets is reflected by the sub-network distributions of the trained models.This is in contrast to other uncertainty methods like MC dropout (right column) that use sub-network distributions to model epistemic uncertainty.This type of uncertainty is small after training a model on the densely sampled toy datasets and consequently MC dropout's sub-network distributions are significantly more narrow compared to Wasserstein dropout.The ground truth data is shown in blue.Each gray line represents the outputs of one of 500 random sub-networks that are obtained by applying dropout-based sampling to the trained full network.For details on the data sets ('toy-hf', 'toy-noise'), the neural architecture and the uncertainty methods please refer to section 4 and references therein.
parametric mechanisms represent uncertainty estimates by dedicated network output variables, which are often interpreted as variance parameters of Gaussian distributions.These modeling techniques are sometimes also referred to as "direct modeling" (Feng et al., 2020).
In this work, we take a different approach and propose to model (aleatoric) uncertainty in DNNs in a novel, fully non-parametric way.We introduce Wasserstein dropout (Wdropout) that is designed to capture heteroscedastic (i.e.input-dependent) data noise by means of its sub-network distribution (see Fig. 1).It builds on the idea of matching the network output distribution, resulting from randomly dropping neurons, to the (factual or implicit) data distribution by minimizing the Wasserstein distance.
In detail, we contribute • by deriving a novel and surprisingly simple Wasserstein-based learning objective for sub-networks that simultaneously optimizes task performance and uncertainty quality, • by conducting an extensive empirical evaluation where W-dropout outperforms stateof-the-art uncertainty techniques w.r.t.various benchmark metrics, not only in-data but also under data shifts, • and by introducing two novel uncertainty measures: a non-saturating calibration score and a measure for distributional tails that allows to analyze worst-case scenarios w.r.t.uncertainty quality.
The remainder of the paper is organized as follows: first, we present related work on uncertainty estimation in neural networks in section 2. Next, Wasserstein dropout is introduced in section 3. We study the uncertainties induced by Wasserstein dropout on various datasets in section 4, paying special attention to safety-relevant evaluation schemes and metrics.An outlook in section 5 concludes the paper.

Related work
Approaches to estimate predictive uncertainties can be broadly categorized into three groups: Bayesian approximations, ensemble approaches and parametric models.
Monte Carlo dropout (Gal and Ghahramani, 2016) is a prominent representative of the first group.It offers a Bayesian motivation, conceptual simplicity and scalability to application-size neural networks (NNs).This combination distinguishes MC dropout from other Bayesian neural network (BNN) approximations like in Blundell et al. (2015) and Ritter et al. (2018).A computationally more efficient version of MC dropout is one-layer or last-layer dropout (see e.g.Kendall and Gal (2017)).Alternatively, analytical moment propagation allows sampling-free MC-dropout inference at the price of additional approximations (e.g.Postels et al. (2019)).Further extensions of MC dropout target tuned performance by learning layer-specific drop rates using Concrete distributions (Gal et al., 2017), the integration of aleatoric uncertainty (Kendall and Gal, 2017), using a parametric approach and input-dependent dropout distributions (Fan et al., 2021).Note that dropout training is used-independent from an uncertainty context-for better model generalization (Srivastava et al., 2014).An alternative sampling-based approach is SWAG which constructs a Gaussian model weight distribution from the (last segment of the) training trajectory (Maddox et al., 2019).
Ensembles of neural networks, so-called deep ensembles (Lakshminarayanan et al., 2017), pose another popular approach to uncertainty modeling.Comparative studies of uncertainty mechanisms (Gustafsson et al., 2020;Snoek et al., 2019) highlight their advantageous uncertainty quality, making deep ensembles a state-of-the-art method.Fort et al. (2019) argue that ensembles capture the multi-modality of loss landscapes thus yielding potentially more diverse sets of solutions.When used in practice, these ensembles additionally include parametric uncertainty prediction for each of their members.
The third group are the before mentioned parametric modeling approaches that extend point estimations by adding a model output that is interpreted as variance or covariance (Heskes, 1996;Nix and Weigend, 1994).Typically, these approaches optimize a (Gaussian) negative log-likelihood (NLL, Nix and Weigend (1994)) and can be easily integrated with other approaches, for a review see Khosravi et al. (2011).A more recent representative of this group is, e.g., deep evidential regression (Amini et al., 2020), which places a prior distribution on Gaussian parameters.A closely related model class is deep kernel learning.It approaches uncertainty modeling by combining NNs and Gaussian processes (GPs) in various ways, e.g., via an additional layer (Iwata and Ghahramani, 2017;Wilson et al., 2016), by using networks as GP kernels (Garnelo et al., 2018) or by matching NN residuals with a GP (Qiu et al., 2020).
In the context of object detection, the number of applicable uncertainty methods is limited by the complexity of the employed NNs.Nonetheless, several variants can be encountered.For instance, MC dropout, see e.g.Bhattacharyya et al. (2018) or Miller et al. (2018), or parametric approaches, see He et al. (2019), can scale to network sizes relevant for such applications.Hall et al. (2020) stress the importance of uncertainty estimation for bounding box detection.
The quality of uncertainties is typically evaluated using negative log-likelihood (Blei and Jordan, 2006;Gal and Ghahramani, 2016;Walker et al., 2016), expected calibration error (ECE, Naeini et al. (2015); Snoek et al. (2019)) and its variants and by considering correlations between uncertainty estimates and model errors, e.g., area under the sparsification error curve (AUSE, Ilg et al. (2018)) for image tasks.Moreover, it is common to study how useful uncertainty estimates are for solving auxiliary tasks such as out-of-distribution classification (Lakshminarayanan et al., 2017) or robustness w.r.t.adversarial attacks.An alternative approach is the investigation of qualitative uncertainty behaviors: Kendall and Gal (2017) check if the epistemic uncertainty decreases when increasing the training set and Wirges et al. (2019) study how the level of uncertainty depends on the distance of the object to a car for some 3D environment regression task.

Wasserstein dropout
Before we lay out our dropout-based approach to modeling aleatoric uncertainty, we analyze some central properties of Monte Carlo dropout.The latter also employs sub-networks, however, for the purpose of modeling epistemic uncertainty (Gal and Ghahramani, 2016): Given a neural network f θ : R d → R m with parameters θ, MC dropout samples sub-networks f θ by randomly dropping nodes from the main model f θ yielding for each input x i a distribution D θ(x i ) over network predictions.During MC dropout inference the final prediction is given by the mean of a sample from D θ(x i ), while the uncertainty associated with this prediction can be estimated as a sum of its variance and a constant uncertainty offset.The value of the latter term requires dataset-specific optimization.During MC dropout training, minimizing the objective function, e.g., the mean squared error (MSE), shifts all sub-network predictions towards the same training targets.For a more formal explanation of this behavior, and without loss of generality, let f θ be a NN with one-dimensional output.The expected MSE for a training sample (x i , y i ) under the model's output distribution D θ(x i ) is given by with sub-network mean . Therefore, training simultaneously minimizes the squared error between sub-network mean µ θ(x i ) and target y i as well as the variance σ 2 θ (x i ).
As we, in contrast, seek to employ sub-networks to model aleatoric uncertainty, minimizing the variance over the sub-networks is not desirable for our purpose.Instead, we aim at explicitly fitting the sub-network variance σ2 θ (x i ) to the input-dependent, i.e. heteroscedastic, data variance.That is to say, we not only match the mean values as in (1) but seek to match the entire data distribution D y (x i ) by means of the model's output distribution D θ(x i ).This output distribution is induced by applying Bernoulli dropout to all activations of the network.To measure the distance between the two distributions D θ(x i ) and D y (x i ), a squared 2-Wasserstein metric (Villani, 2008) is employed.As it is 'transport'-based it can provide a training signal also for non-overlapping distributions 1 and reduces to the "original" MSE loss for point masses, i.e., in the absence of aleatoric uncertainty.Assuming that both distributions D θ(x i ) and D y (x i ) are Gaussian 2 yields a compact analytical expression ) 2 ], and µ y , σ y defined analogously w.r.t. the data distribution.
In practice however, (2) cannot be readily used as the distribution of y given x i is typically not accessible.Instead, for a given, fixed value of x i from the training set only a single value of y i is known.Therefore, we take y i as a (rough) one-sample approximation of the mean µ y (x i ) resulting in µ y (x i ) ≈ y i and σ . However, σ 2 y (x i ) cannot be inferred from a single sample.Inspired by parametric bootstrapping (Dekking et al., 2005;Hastie et al., 2009), we therefore approximate the empirical data variance (for a given mean value y i and input x i ) with samples from our model, i.e., we approximate E y [(y − y i ) 2 ] by Inserting our approximations µ y (x i ) ≈ y i and σ y (x i ) ≈ (µ θ(x i ) − y i ) 2 + σ 2 θ (x i ) into (2) yields the Wasserstein dropout loss (W-dropout) for a data point (x i , y i ) from the training distribution: Considering a mini-batch of size M instead of a single data point, we arrive at the optimization objective WS 2 are approximated by empirical estimators using a sample size L. In contrast to MC dropout we require thereby L stochastic forward passes per data point during training (instead of one), while at inference procedures are exactly the same.
Besides the regression tasks considered here our approach could be useful for other objectives which use or benefit from an underlying distribution, e.g., Dirichlet distributions to quantify uncertainty in classification, as discussed in the conclusion.

Experiments
We first outline the scope of our empirical study in subsection 4.1 and begin with experiments on illustrative and visualizable toy datasets in subsection 4.2.Next, we benchmark W-dropout on various 1D datasets (mostly from the UCI machine learning repository (Dua and Graff, 2017)) in subsection 4.3, considering both in-data and distribution-shift scenarios.In subsection 4.4, W-dropout is applied to the complex task of object detection using the compact SqueezeDet architecture (Wu et al., 2017).

Benchmark approaches and evaluation measures
In this subsection, we present the considered benchmark approaches (first paragraph) and evaluation measures for uncertainty modeling.Aside established measures (second paragraph), we propose two novel uncertainty scores: an unbounded calibration measure and an uncertainty tail measure for the analysis of worst-case scenarios w.r.t.uncertainty quality (third and forth paragraph).A brief overview of the technical setup (last paragraph) concludes the subsection.

Benchmark approaches
We compare W-dropout networks to archetypes of uncertainty modeling, namely approximate Bayesian techniques, parametric uncertainty, and ensembling approaches.From the first group, we pick MC dropout (abbreviated as MC, Gal and Ghahramani (2016)) and Concrete dropout (CON-MC, Gal et al. (2017)).The variance of MC is given as the sample variance plus a dataset-specific regularization term.The networks employing these methods do not exhibit parametric uncertainty outputs (see below).We additionally consider SWA-Gaussian (SWAG, Maddox et al. (2019)), which samples from a Gaussian model weight distribution that is constructed based on model parameter configurations along the (final segment of the) training trajectory.While these samplingbased approaches integrate uncertainty estimation into the structure of the entire network, parametric approaches model the variance directly as the output of the neural network (Nix and Weigend, 1994).Such networks typically output mean and variance of a Gaussian distribution (µ, σ 2 ) and are trained by likelihood maximization.This approach is denoted as PU for parametric uncertainty.Ensembles of PU-networks (Lakshminarayanan et al., 2017), referred to as deep ensembles, pose a widely used state-of-the-art method for uncertainty estimation (Snoek et al., 2019).Deep evidential regression (PU-EV, Amini et al. (2020)) extends this parametric approach and considers prior distributions over µ and σ.Kendall and Gal (2017) consider drawing multiple dropout samples from a parametric uncertainty model and aggregating multiple predictions for µ and σ.We denote this approach PU-MC.Moreover, we consider ensembles of non-parametric standard networks.We refer to the latter ones as DEs while we call those using additionally PU-based uncertainty PU-DEs.All considered types of networks provide estimates (µ i , σ i ) where σ i is obtained either as direct network output (PU, PU-EV), by sampling (MC, CON-MC, SWAG, W-dropout) or as an ensemble aggregate (DE, PU-DE).For PU-MC, a combination of parametric output and sampling is employed.Throughout this section, we subsume PU, PU-EV, PU-DE and PU-MC as "parametric methods".

Standard evaluation measures
In all experiments we evaluate both regression performance and uncertainty quality.Regression performance is quantified by the root-meansquare error (1/N i (µ i − y i ) 2 (RMSE, Bishop (2006)).Another established metric in the uncertainty community is the (Gaussian) negative log-likelihood (NLL), 1/N i log σ i +(µ i − y i ) 2 /(2σ 2 i ) + c , a hybrid between performance and uncertainty measure (Gneiting and Raftery, 2007), see appendix C.2 for a discussion.Throughout the paper, we ignore the constant c = log √ 2π of the NLL.The expected calibration error (ECE, Kuleshov et al. (2018)) in contrast is not biased towards well-performing models and in that sense a pure uncertainty measure.It reads ECE = B j=1 |p j − 1/B| for B equally spaced bins in quantile space and pj = |{r i |q j ≤ q(r i ) < q j+1 }|/N the empirical frequency of data points falling into such a bin.Their normalized prediction residuals r i are defined as r i = (µ i − y i )/σ i .Further, q is the cdf of the standard normal distribution N (0, 1) and [q j , q j+1 ) are equally spaced intervals on [0, 1], i.e., q j = (j − 1)/B.
An unbounded uncertainty calibration measure A desirable property for uncertainty measures is a signal that grows (preferentially linearly) with the misalignment between predicted and ideal uncertainty estimates, especially when handling strongly deviating uncertainty estimates.As the Wasserstein metric fulfils this property, we not only use it for model optimization but propose to consider the 1-Wasserstein distance of normalized prediction residuals (WS) as a complementary uncertainty evaluation measure.It is generally applicable and by no means restricted to W-dropout networks.In detail, the 1-Wasserstein distance (Villani, 2008), also known as earth mover's distance (Rubner et al., 1998), is a transport-based measure, denoted by d WS , between two probability densities, with Wasserstein GANs (Arjovsky et al., 2017) as its most prominent application in machine learning.In the context of uncertainty estimation, we use the Wasserstein distance to measure deviations of uncertainty estimates {r i } i from ideal (Gaussian)3 calibration that is given if y i ∼ N (µ i , σ i ) with accompanying normalized residuals of r i ∼ N (0, 1), i.e. we calculate d WS ({r i } i , N (0, 1)).As ECE, this is a pure uncertainty measure.However, it is not based on quantiles but directly on normalized residuals and can therefore resolve deviations on all scales.For example, two strongly ill-calibrated uncertainties would result in (almost) identical ECE values while WS would resolve this difference in magnitude.Let us compare ECE and WS more systematically: we consider normal distributions N (µ, 1) and N (0, σ) (see Fig. 2) that are shifted (top left panel, dark blue) and squeezed/stretched (bottom left panel, dark blue), respectively.Their deviations from the ideal normalized residual distribution (the standard normal, red) are measured in terms of both ECE (r.h.s., blue) and WS (r.h.s., orange).For large values of |µ| and σ, ECE is bounded while WS increases linearly showing the better sensitivity of the latter towards strong deviations.For small values, σ → 0, ECE takes its maximum value, WS a value of 1.In Fig. 3, we visualize these value pairs (WS(σ), ECE(σ)) (gray lines), i.e. σ serves as curve parameter.The upper 'branch' corresponds to 0 < σ < 1, the lower 'branch' to σ > 1.For comparison, the pairs (WS, ECE) of various networks trained on standard regression datasets are visualized (see subsection 4.3 for experimental details and results).They approximately follow the theoretical σ-curve, emphasizing that both under-and overestimating variance is of practical relevance.A given WS value allows, due to lacking saturation for underestimation, to distinguish these two cases more easily compared to ECE.While one might rightfully argue that the higher sensitivity of WS leads to a certain susceptibility to potential outliers, this can be addressed by regularizing the normalized residuals or by filtering extreme outliers.A novel uncertainty tail measure We furthermore introduce a measure for distributional tails that allows to analyze worst-case scenarios w.r.t.uncertainty quality, thus reflecting safety considerations.Such potentially critical worst-case scenarios are signified by the above mentioned outliers, where the locally predicted uncertainty strongly underestimates the actual model error.A better understanding of uncertainty estimates in these scenarios might allow to determine lower bounds on operation quality of safety-critical systems.For this, we consider normalized residuals r i = (µ i − y i )/σ i based on the prediction estimates (µ i , σ i ) for a given data point (x i , y i ).As stated, we restrict our analysis to uncertainty estimates that underestimate model errors, i.e., |r i | 1.These cases might be more harmful than overly large uncertainties, |r i | 1, that likely trigger a conservative system behavior.We quantify uncertainty quality for worst-case scenarios as follows: for a given (test) dataset, the absolute normalized residuals {|r i |} i are calculated.We determine the 99% quantile q 0.99 of this set and calculate the mean value over all |r i | > q 0.99 , the so-called expected tail loss at quantile 99% (ETL 0.99 , Rockafellar and Uryasev (2002)).The ETL 0.99 thus measures the average uncertainty quality of the worst 1%.
Technical setup For the first two parts we use almost identical setups of 2 hidden layers with ReLu activations, using 50 neurons per layer for the toy datasets and 100 for the 1D standard datasets.All dropout-based networks (MC, CON-MC, W-dropout) apply Bernoulli dropout to all hidden activations.For W-dropout networks, we sample L = 5 sub-networks in each optimization step, other values of L are considered in appendix B.
On the smaller toy datasets, we afford L = 10.For MC and W-dropout, the drop rate is set to p = 0.1 (see appendix B for other values of p).The drop rate of CON-MC in contrast is learned during training and (mostly) takes values between p = 0.2 and p = 0.5.For ensemble methods (DE, PU-DE) we employ 5 networks.All NNs are optimized using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001.Additionally, we apply standard normalization to the input and output features of all datasets to enable better comparability.The number of training epochs and cross validation runs depends on the dataset size.Further technical details on the networks, the training procedure, and the implementation of the uncertainty methods can be found in appendix A.1.In using a least squares regression, we make the standard assumption that errors follow a Gaussian distribution.This is reflected in the (standard) definitions of above named measures, i.e., all uncertainty measures quantify the set of outputs {(µ i , σ i )} relative to a Gaussian distribution.

Toy datasets
To illustrate qualitative behaviors of the different uncertainty techniques, we consider two R → R toy datasets.This benchmark puts a special focus on the handling of aleatoric heteroscedastic uncertainty.The first dataset is Gaussian white noise with an x-dependent amplitude, see first row of Fig. 4. The second dataset is a polynomial overlayed with a high-frequency, amplitude-modulated sine, see fourth row of Fig. 4. The explicit equations for the toy datasets used here can be found in appendix A.2.While the uncertainty in the first dataset ('toy-noise') is clearly visible, it is less obvious for the fully deterministic second dataset ('toy-hf').There is an effective uncertainty due to the insufficient expressivity of the model though, as the shallow networks employed are empirically not able to fit (all) fluctuations of 'toy-hf' (see fifth row of Fig. 4).One might (rightfully) argue that this is a sign of insufficient model capacity.But in more realistic, e.g., higher dimensional and sparser datasets the distinction between true noise and complex information becomes exceedingly difficult to make and regularization is actively used to suppress the modeling of (ideally) undesired fluctuations.As the Nyquist-Shannon sampling theorem states, with limited data deterministic fluctuations above a cut-off frequency can no longer be resolved (Landau, 1967).They therefore become virtually indistinguishable from random noise.
The mean estimates of all uncertainty methods (second and fifth row in Fig. 4) look alike on both datasets.They approximate the noise mean and the polynomial, respectively.In the latter case, all methods rudimentarily fit some individual fluctuations.The variance estimation (third and sixth row in Fig. 4) in contrast reveals significant differences between the methods: MC dropout variants and other non-parametric ensembles are not capable of capturing heteroscedastic aleatoric uncertainty.This behavior of MC is expectable as it was primarily introduced to account for model uncertainty.The non-parametric DE is effectively optimized in a similar fashion.In contrast, NLL-optimized PU networks have a home-turf advantage on these datasets since the parametric variance is explicitly optimized to account for the present heteroscedastic aleatoric uncertainty.W-dropout is the only non-parametric approach that accounts for the presence of this kind of uncertainty.While the results look similar, the underlying mechanisms are fundamentally different.On the one hand explicit prediction of the uncertainty, on the other hand implicit modeling via distribution matching.Accompanying quantitative evaluations can be found in appendix A.2.To collect further evidence that W-dropout approximates the true ground truth uncertainty σ true appropriately, we fit it to 'noisy line' toy datasets in appendix A.2.Both large and small σ true values are correctly matched, indicating that W-dropout is not just adding an uncertainty offset but flexibly spreads/contracts its sub-networks as intended.In the following, we substantiate the corroborative results of W-dropout on toy data by an empirical study on 1D standard datasets and an application to a modern object detection network.

Standard 1D regression datasets
Next, we study standard regression datasets, extending the dataset selection in Gal and Ghahramani (2016) by adding four additional datasets: 'diabetes', 'abalone', 'california', and 'superconduct'.Table 8 in appendix A.3 provides details on dataset sources, preprocessing and basic statistics.Apart from train-and test-data results, we study regression performance and uncertainty quality under data shift.Such distributional changes and uncertainty quantification are closely linked since the latter ones are rudimentary "selfassessment" mechanisms that help to judge model reliability.These judgements gain importance for model inputs that are structurally different from the training data.
Data splits Natural candidates for such non-i.i.d.splits are splits along the main directions of data in input and output space, respectively.Here, we consider 1D regression tasks.Therefore, output-based splits are simply done on a scalar label variable (see Fig. 5, right).We call such a split label-based (for a comparable split, see, e.g., Foong et al. ( 2019)).In input space, the first component of a principal component analysis (PCA) provides a natural direction (see Fig. 5, left).Projecting the data points onto this first PCA-axis yields the scalar values the PCA-split is based on.Note that these projections are only considered for data splitting, they are not used for model training.Splitting data along such a direction in input or output space in, e.g., 10 equally large chunks, creates 2 outer data chunks and 8 inner data chunks.Training a model on 9 of these chunks such that the remaining chunk for evaluation is an inner chunk is called data interpolation.If the remaining test chunk is an outer chunk, it is data extrapolation.For example, for labels running from 0 to 1, (label-based) extrapolation testing would consider only data with a label larger 0.9, while training would be performed on the smaller label values.We introduce this distinction as extrapolation is expected to be considerably more difficult than 'bridging' between feature combinations that were seen during training.More general information on training and dataset-dependent modifications to the experimental setup are relegated to the technical appendix A.1.The presented results are obtained as follows: for each of the 14 standard datasets, we calculate (for each uncertainty method) the per-dataset scores: RMSE, mean NLL, ECE and WS.To improve statistical significance, these scores are 5-or 10-fold cross-validated, i.e. averages across a respective number of folds.Given the (fold-averaged) per-dataset scores for all 14 standard datasets, we calculate and visualize their mean and median values as well as quantile intervals (see Fig. 6 and Fig. 7).For high-level summaries of the results on in-data and out-of-data test sets please refer to Tab. 1 and Tab. 2, respectively.While the mean values characterize the average behavior of the uncertainty methods, the displayed 75% quantiles indicate how well methods perform on the more challenging datasets.A small 75% quantile value thus hints at consistent stability of an uncertainty mechanism across a variety of tasks.
Regression quality First, we consider regression performance, see Tab. 1 and the first two panels in the top row of Fig. 6.Averaging the RMSE values across the 14 datasets yields almost identical test results for all uncertainty methods (see Tab. 1).On training data (Fig. 6, first panel in top row) in contrast, we find the parametric methods to exhibit larger train data RMSEs which could be due to NLL optimization favoring to adapt variance rather than mean.However, this regularizing NLL training comes along with a smaller generalization gap, leading to competitive test RMSEs (see Tab. 1 and the second panel in the top row of Fig. 6).W-dropout is on a par with the benchmark approaches, i.e. our optimization objective does not lead to degraded regression quality.Next, we investigate model performance under data shift, visualized in the third to sixth panel in the top row of Fig. 6.For interpolation setups (fourth and sixth panel), regression quality is comparable between all methods.As expected, performances under these data shifts are (slightly) worse compared to those on i.i.d.test sets.The more challenging extrapolation setups (third and  Interestingly, both SWAG and W-dropout show a relatively broad range of ECE values on the various training datasets.This could be interpreted as a form of over-estimation of the present uncertainty and for W-dropout this effect occurs on mostly smaller datasets with lower data variability.However, looking at the i.i.d.test results (Tab. 1 and second panel in the bottom row of Fig. 6) we find W-dropout to provide the lowest averaged ECE (Tab.1), followed by the PU-based (implicit) ensembles of PU-DE and PU-MC.The calibration quality of W-dropout is moreover the most consistent one across the datasets as can be seen from its small 75% quantile value (Fig. 6, second panel in bottom row).
Looking at the stability w.r.t.data shift, i.e., extra-and interpolation based on labelsplit or PCA-split, again W-dropout reaches the smallest calibration errors (followed by PU-DE and PU-MC, see Tab. 2).Regarding the 75% quantiles, W-dropout consistently provides one of the best results on all out-of-data (OOD) test sets.
Negative log-likelihoods For the unbounded NLL (see Tab. 1 and the top row of Fig. 7), the results are more widely distributed compared to the (bounded) ECE values.W-dropout reaches the smallest mean value on i.i.d.test sets, followed by MC and PU-MC (Tab.1).
The mean NLL value of PU is above the upper plot limit in Fig. 7 (second panel in the upper row) indicating a rather weak stability of this method.On PCA-interpolate and PCAextrapolate test sets (Fig. 7, last two panels in the upper row), MC, PU-MC and W-dropout networks perform best.On label-interpolate and label-extrapolate test sets, only MC and W-dropout networks are in first place when considering average values, followed by PU-EV.
The mean NLLs of many other approaches are above the upper plot limit.Averaging all these OOD results in Tab. 2, we find W-dropout to provide the overall smallest NLL values, narrowly followed by MC.Note that median results are not as widely spread and PU-DE, MC, PU-MC and W-dropout perform comparably well.These qualitative differences between mean and median behavior indicate that most methods perform poorly 'once in a while'.A noteworthy observation as stability across a variety of data shifts and datasets can be seen as a crucial requirement for an uncertainty method.W-dropout models yield high stability in that sense w.r.t.NLL.
Wasserstein distances Studying Wasserstein distances, we again observe the smallest scores on test data for W-dropout, followed by PU-MC and PU-DE (see Tab. 1 and the second panel in the bottom row of Fig. 7).While PU provides the best WS value on training data, its generalization behavior is less stable: on test data, its mean and 75% quantile take high values beyond the plot range.Under data shift (Tab. 2 and third to sixth panel in bottom row of Fig. 7), W-dropout and MC are in the lead, CON-MC and DE follow on ranks three and four.On label-based data shifts, MC and W-dropout outperform all other methods by a significant margin when considering average values.As for NLL, we find the mean values for PU-DE and PU-MC to be significantly above their respective median values indicating again weaknesses w.r.t. the stability of parametric methods.Here as well, not only good average results, but also consistency over the datasets and splits, is a hallmark of Wasserstein dropout.
Epistemic uncertainty Summarizing these evaluations on 1D regression datasets, we find W-dropout to yield better and more stable uncertainty estimates than the state-ofthe-art methods of PU-DE and PU-MC.We moreover observe advantages for W-dropout under PCA-and label-based data shifts.These results suggest that W-dropout induces uncertainties which increase under data shift, i.e., it approximately models epistemic uncertainty.This conjecture is supported by Fig. 8 that visualizes the uncertainties of MC dropout (blue) and W-dropout (orange) for transitions from in-data to out-of-data.As expected, these shifts lead to increased (epistemic) uncertainty for MC dropout.This holds true for W-dropout that behaves highly similar under data shift indicating that it "inherits" this ability from MC dropout: both approaches match sub-networks to training data and these sub-networks "spread" when leaving the training data distribution.Since W-dropout models heteroscedastic, i.e. input-dependent, aleatoric uncertainty, we notice a higher variability of its uncertainties in Fig. 8 compared to the ones of MC dropout.
For further (visual) inspections of uncertainty quality, see the residual-uncertainty scatter plots in appendix A.4.A reflection on NLL and comparisons of the different uncertainty measures on 1D regression datasets can be found in appendix A.3.
Expected tail loss For both toy and standard regression datasets, we calculate the expected tail loss at the 99% quantile (ETL 0.99 ) on test data.Doing this for all trained networks yields a total of 110 ETL 0.99 values per uncertainty method when including crossvalidation.As a tail measure, the ETL 0.99 evaluates a specific aspect of the distribution of uncertainty estimates.Studying such a property is useful if the uncertainty estimate distribution as a whole is appropriate, as measured e.g. by the ECE.We thus restrict the ETL 0.99 analysis to the three methods that provide the best ECE values, namely PU-MC, PU-DE and W-dropout.The mean and maximum values of their ETL 0.99 's are reported in Table 3.While none of these methods gets close to the ideal ETL 0.99 's of the desired N (0, 1) Gaussian, W-dropout networks exhibit significantly less pronounced tails and therefore higher stability compared to PU-MC and PU-DE.This holds true over all considered test sets.Deviations from standard normal increase from the i.i.d.train-test split over the PCA-based train-test split to the label-based one.We attribute the lower stability of PU-DE to the nature of the PU networks that compose the ensemble, although their inherent instability (see Table 9 in appendix A.3) is largely suppressed by ensembling.Considering the tail of the distribution of the prediction residuals |r i |, however, reveals that regularization of PU by ensembling might not work in every single case.It is then unlikely that larger ensemble are able to fully cure this instability issue.Regularizing PU by applying dropout (PU-MC) leads to only mild improvement.W-dropout networks in contrast encode uncertainty into the structure of the entire network thus yielding improved stability compared to parametric approaches.Further analysis shows that the large normalized residuals r i = (µ i − y i )/σ i , which cause the large ETL 0.99 values, correspond (on average) to large absolute errors (µ i − y i ). 4 This underpins the practical relevance of the ETL analysis, as large absolute errors are more harmful than small ones in many contexts, e.g. when detecting traffic participants.
Dependencies between uncertainty measures All uncertainty-related measures (NLL, ECE, WS, ETL) relate predicted uncertainties to actually occurring model residuals.Each 4. They are (on average) not due to small absolute residuals 1 that go along with even smaller uncertainty estimates.While all these scores are expectably correlated, noteworthy deviations from ideal correlation occur.Therefore, we advocate for uncertainty evaluations based on various measures to avoid overfitting to a specific formalization of uncertainty.The top panel of Fig. 19 reflects the higher sensitivity of the Wasserstein distance compared to ECE: we observe two "slopes", the first one corresponds to models that overestimate uncertainties, i.e., σ θ > |µ θ − y i | on average.In these scenarios, WS is typically below 1 as 1 would be the WS distance between a delta distribution at zero (corresponding to σ θ → ∞) and the expected N (0, 1) Gaussian.The second "slope" contains models that underestimate uncertainties, i.e., σ θ < |µ θ − y i |.WS is not bounded in these scenarios and is thus-unlike ECE-able to resolve differences between any two uncertainty estimators.

Application to object regression
After studying toy and standard regression datasets, we turn towards the challenging task of object detection (OD), namely the SqueezeDet model (Wu et al., 2017), a fully convolutional neural network.First, we adopt the W-dropout objective to SqueezeDet (see the following paragraph).Next, we introduce the six considered OD datasets and sketch central technical aspects of training and inference.Since OD networks are often employed in open-world applications (like autonomous vehicles or drones), they likely encounter various types of concept shifts during operations.In such novel scenarios, well-calibrated "self-assessment" capabilities help to foster safe functioning.We therefore evaluate Wasserstein-SqueezeDet Architecture SqueezeDet takes an RGB input image and predicts three quantities: (i) 2D bounding boxes for detected objects (formalized as a 4D regression task), (ii) a confidence score for each predicted bounding box and (iii) the class of each detection.Its architecture is as follows: First, a sequence of convolutional layers extracts features from the input image.Next, dropout with a drop rate of p = 0.5 is applied to the final feature representations.Another convolutional layer, the ConvDet layer, finally estimates prediction candidates.In more detail, SqueezeDet predictions are based on so-called anchors, initial bounding boxes with prototypical shapes.The ConvDet layer computes for each such anchor a confidence score, class scores and offsets to the initial position and shape.The final prediction outputs are obtained by applying a non-maximum-suppression (NMS) procedure to the prediction candidates.The original loss of SqueezeDet is the sum of three terms.It reads L SqueezeDet = L regres + L conf + L class with the bounding box regression loss L regres , a confidence-score loss L conf and the object-classification loss L class .Our modification of the learning objective is restricted to the L2 regression loss: with δξ ijk and δξ G ijk being estimates and ground truth expressed in coordinates relative to the k-th anchor at grid point (i, j) where ξ ∈ {x, y, w, h}.See Wu et al. (2017) for descriptions of all other loss parameters.Applying W-dropout component-wise to this 4D regression problem yields where ijk being the sample mean and σ 2 ijk −µ δξ ijk ) 2 being the sample variance over L dropout predictions δξ (l) ijk for ξ ∈ {x, y, w, h}.
Datasets We train SqueezeDet networks on six traffic scene datasets: KITTI (Geiger et al., 2012), SynScapes (Wrenninge and Unger, 2018), A2D2 (Geyer et al., 2020), Nightowls (Neumann et al., 2018), NuImages (NuScenes) (Caesar et al., 2020) and BDD100k (Yu et al., 2020).They differ from each other in dataset size (the large BDD100k dataset contains almost 20 times more images than the small KITTI dataset, see Table 4), time of day (Nightowls comprises only nighttime images) and data acquisition (SynScapes is simulationbased).For further information on the datasets, see Table 10 in appendix A.5.We employ image sizes of 672 × 384 and rescale all datasets (except for KITTI 5 ) accordingly.To facilitate cross-dataset model evaluations (see paragraphs on OOD analyses in this section), we group the various object classes of the six datasets into three main categories: 'pedestrian', 'cyclist' and 'vehicle' (see Table 11 in appendix A.5 for the object class mapping).Some static or rare object classes are discarded.
Technical aspects We compare MC-SqueezeDet, i.e., standard SqueezeDet with activated dropout at inference, with W-SqueezeDet that uses W-dropout instead of the original MSE regression loss.All models are trained for 300,000 mini-batches of size 20.After training, we keep dropout active and compute 50 forward passes for each test image.The detections from all forward passes are clustered using k-means (Bishop, 2006). 6The number of clusters is chosen for each image to match the average number of detections across the 50 forward passes.Each cluster is summarized by its mean detection and standard deviation.To ensure meaningful statistics, we discard clusters with 4 or less detections.The cluster means are matched with ground truth.We exclude predictions from the evaluation if their IoU with ground truth is ≤ 0.1.For each dataset, SqueezeDet's maximum number of detections is chosen proportionally to the average number of ground truth objects per image.
5. For KITTI, we crop images in x-direction to avoid strong distortions due to its high aspect ratio.In y-direction, only a minor upscaling is applied.6.Using the density-based clustering technique HDBSCAN (Campello et al., 2013)   Regarding ECE (bottom row of Fig. 9), W-SqueezeDet performs consistently stronger, see the 'violet' W-SqueezeDet diagonal (smaller values) and the 'red' MC-SqueezeDet diagonal (higher values).These findings qualitatively resemble those on the standard regression datasets and indicate that W-dropout works well on a modern application-scale network.
To analyze how well these OD uncertainty mechanisms function on test data that is structurally different from training data, we consider two types of out-of-data analyses in the following: first, we study SqueezeDet models that are trained on one OD dataset and evaluated on the test sets of the remaining five OD datasets.A rather 'semantic' OOD study as features like object statistics and scene composition vary between training and OOD test sets.Second, we consider networks that are trained on one OD dataset and evaluated on corrupted versions (defocus blur, Gaussian noise) of the respective test set, thus facing changed 'low-level' features, i.e. less sharp edges due to blur and textures overlayed with pixel noise, respectively.
Out-of-data evaluation on other OD datasets We train one SqueezeDet on each of the six OD datasets and evaluate each of these models on the test sets of the remaining 5 datasets.The resulting OOD regression scores and OOD ECE values are visualized as off-diagonal elements in Fig. 9 for MC-SqueezeDet (left column) and W-SqueezeDet (right column).Since datasets are ordered by size (a rough proxy to dataset complexity), the upper triangular matrix corresponds to cases in which the evaluation dataset is especially challenging ("easy to hard"), while the lower triangular matrix subsumes easier test sets compared to the respective i.i.d.test set ("hard to easy").Accordingly, we observe (on average) lower RMSE values in the lower triangular matrix for both SqueezeDet variants.The ECE values of W-SqueezeDet are once more smaller ('violet') compared to MC-SqueezeDet ('red').The ECE diagonal of W-SqueezeDet is visually more pronounced compared to the one of MC-SqueezeDet since uncertainty calibration is effectively optimized during the training of W-SqueezeDet.The Nightowls dataset causes a cross-shaped pattern, indicating that neither transfers of Nightowls models to other datasets nor transfers from other models to Nightowls work well.This behavior can be understood as the feature distributions of Nightowls' nighttime images diverge from the (mostly) daytime images of the other datasets.The high uncertainty quality of W-SqueezeDet is underpinned by the evaluations of NLL and WS (see Fig. 15 and text in appendix A.5).  6).We observe a less substantial deterioration of uncertainty quality for blurring compared to adding pixel noise, possibly because the latter one more strongly affects short-range pixel correlations that the networks rely on.

Conclusion
The prevailing approaches to uncertainty quantification rely on parametric uncertainty estimates by means of a dedicated network output.In this work, we propose a novel type of uncertainty mechanism, Wasserstein dropout, that quantifies (aleatoric) uncertainty in a purely non-parametric manner: by revisiting and newly assembling core concepts from existing dropout-based uncertainty methods, we construct distributions of randomly drawn sub-networks that closely approximate the actual data distributions.This is achieved by a natural extension of the Euclidean metric (L 2 -loss) for points to the 2-Wasserstein metric for distributions.In the limit of vanishing distribution width, i.e. vanishing uncertainty, both metrics coincide.Assuming Gaussianity and making a bootstrap approximation, the metric can be replaced by a compact loss objective affording stable training.To the best of our knowledge, W-dropout is the first non-parametric method to model aleatoric uncertainty in neural networks.It outperforms the ubiquitous parametric approaches, as, e.g., shown by our comparison to deep ensembles (PU-DE).
An extensive additional study of uncertainties under data shift further reveals advantages of W-dropout models compared to deep ensembles (PU-DE) and parametric models combined with dropout (PU-MC): the Wasserstein-based technique still provides (on average) better calibrated uncertainty estimates while coming along with a higher stability across a variety of datasets and data shifts.In contrast, we find parametric uncertainty estimation (PU) to be prone to instabilities that are only partially cured by the regularizing effects of explicit or implicit (dropout-based) ensembling (PU-DE, PU-MC).With respect to worst-case scenarios, W-dropout networks are by a large margin better than either PU-DE or PU-MC.This makes W-dropout especially suitable for safety-critical applications like automated driving or medical diagnosis where (even rarely occurring) inadequate uncertainty estimates might lead to injuries and damage.Furthermore, while our theoretical derivation focuses on aleatoric uncertainty, the presented distribution-shift experiments suggest that W-dropout is also able to capture epistemic uncertainty.Finding a theoretical explanation for that is subject of future research.
With respect to computational demands, W-dropout is roughly equivalent to MC dropout (MC) and, in fact, could be used as a drop-in replacement for the latter.While L-fold sampling of sub-networks increases the training complexity, we observe an increase of training time that is significantly below L in our implementation.Inference is performed in the same way for both methods and thus also their run-time complexities are equivalent.In comparison to deep ensembles, W-dropout's use of a single network reduces requirements on training and storage at the expense of multiple forward passes during inference.This property is shared with MC and approaches exist to reduce the prediction cost, for instance last-layer MC allows sampling-free inference (see also Postels et al. (2019)).
In addition to the toy and 1D regression experiments, SqueezeDet is selected as a representative of large-scale object detection networks.We find the above mentioned properties of Wasserstein dropout to carry over to Wasserstein-SqueezeDet, namely the enhanced uncertainty quality and its increased stability under different types of data shifts.At the same time observed performance losses are minimal.Overall, our experiments on SqueezeDet show that W-dropout scales to larger networks relevant for practical applications.
Taking a step back, the idea to "migrate" from single-point modeling to full distributions is a very general one and can be applied to a variety of tasks.Replacing, e.g., Gaussians with Dirichlet distributions makes an application to classification conceivable, where Malinin and Gales (2018) employ parametric (Dirichlet) distributions to quantify uncertainty.Conceptually, our findings suggest that distribution modeling based on sampling generalizes better compared to parameterized counterparts.An observation that might find applications far outside the scope of uncertainty quantification.
All experiments are conducted on Core Intel(R) Xeon(R) Gold 6126 CPUs and NVidia Tesla V100 GPUs.Conducting the described experiments with cross validation on one CPU takes 20 h for toy data, 130 h for 1D regression datasets and approximately 100 h for object regression on the GPU.

A.2. Toy datasets: systematic evaluation and further experiments
The toy-noise and toy-hf datasets are sampled from f noise (x) ∼ N (0, exp(−0.02x 2 )) for x ∈ [−15, 15] and f hf (x) = 0.25 x 2 −0.01 x 3 +40 exp(−(x+1) 2 / 200) sin(3 x) for x ∈ [−15, 20], respectively.Standard normalization is applied to input and output values.Detailed evaluations of the considered uncertainty methods on these datasets are given in Table 7.To illustrate the capabilities and limitations of MC dropout regarding the modeling of aleatoric uncertainty, we consider the toy-noise dataset again and systematically vary MC's regularization parameter λ (see Fig. 10, λ decreases from left to right).As MC dropout's uncertainty estimates contain an additive constant term proportional to λ, tuning this parameter allows to model the average aleatoric uncertainty (the ideal λ in Fig. 10 is between λ = 10 −6 and λ = 10 −5 ).Input dependencies of noise (heteroscedasticity) can however not be incorporated, i.e. even an optimized λ causes systematic over-and underestimations of the data uncertainty in many cases.This is in contrast to W-dropout.
Having shown that W-dropout can approximate input-dependent data uncertainty appropriately (see Fig. 1), we now analyze its ability to match ground truth uncertainties σ true more systematically.Therefore, we fit a 'noisy line' toy dataset that is given by (x i , y i ) with x i ∼ U(−1, 1) and y i ∼ N (0, σ true ).The ground truth standard deviations take the values σ true = 0, 0.1, 0.2, 0.5, 1, 2, 5, 10.Fig. 11 emphasizes that W-dropout provides accurate uncertainty estimates for both small and large noise levels.Minor x-dependent fluctuations (see 'whiskers' in Fig. 11) decrease monotonically with σ true .As the regularizer is not input-dependent, it does not capture the x-dependency of the noise level, i.e. the heteroscedasticity of the dataset, see third row.

A.3. Standard regression datasets: systematic evaluation
An overview on the 1D regression datasets providing basic statistics and information on preprocessing is given in Table 8.Evaluations of RMSE, NLL, ECE and WS on dataset level can be found in Table 9.

A.4. Residual-uncertainty scatter plots
Visual inspection of uncertainties can be helpful to understand their qualitative behavior.We scatter model residuals µ i − y i (respective x-axis in Fig. 13) against model uncertainties σ i (resp.y-axis in Fig. 13).For a hypothetical ideal uncertainty mechanism, we expect (y i − µ i ) ∼ N (0, σ i ), i.e., model residuals following the predictive uncertainty distribution.More concretely, 68.3% of all (y i − µ i ) would lie within the respective interval [−σ i , σ i ] and 99.7% of all (y i − µ i ) within [−3 σ i , 3 σ i ].Fig. 12 visualizes this hypothetical ideal.It is generated as follows: We draw 3,000 standard deviations σ i ∼ U(0, 2) and sample residuals r i from the respective normal distributions, r i ∼ N (0, σ i ).The pairs (r i , σ i ) are visualized.By construction, uncertainty estimates now ideally match residuals in a distributional sense.Geometrically, the described Gaussian properties imply that 99.7% of all scatter points, e.g., in Fig. 13, should lie above the blue 3σ lines and 68.3% them above the yellow 1σ lines.For toy-noise, abalone and superconduct (first, third and fourth row in Fig. 13), PU, PU-DE and W-dropout qualitatively fulfill this requirement while MC, MC-LL and DE tend to underestimate uncertainties.This finding is in accordance with our systematic evaluation.The naval dataset (second row in Fig. 13) poses an exception in this regard as all uncertainty methods lead to comparably convincing uncertainty estimates.The small test RMSEs of all methods on naval indicate relatively small aleatoric uncertainties and model residuals.Epistemic uncertainty might thus be a key driving factor and coherently MC, MC-LL and DE perform well.

A.5. Object detection: systematic evaluation
We report basic information on the object detection (OD) datasets and their harmonization in the first paragraph of this subsection.Supplementary evaluations of SqueezeDet can be found subsequently in the second paragraph.

Details on OD datasets
The six OD datasets we consider are diverse in multiple dimensions as they capture traffic scenes from three continents (Asia, Europe and North America) and cover a broad set of scenarios ranging from cities and metropolitan areas over country roads to highways (see Table 10).They moreover differ in the average number of objects per image (see Table 4) that reaches its highest values for the simulation-based Further results on SqueezeDet Coordinate-wise regression results and uncertainty scores for MC-SqueezeDet and W-SqueezeDet on KITTI are shown in Table 12.While we observe noteworthy differences between coordinates, the relative ordering of MC-SqueezeDet and W-SqueezeDet for a given measure remains the same.Analyzing in-data and out-of-data NLL and WS values for all six datasets (see Fig. 15), we find results that qualitatively resemble those on ECE in Fig. 9. W-SqueezeDet outperforms MC-SqueezeDet on the respective i.i.d.test set and also under data shift.For both uncertainty approaches, some NLL values are affected by outliers.explained by the fact that the L sub-networks in a given optimization step overlap less for higher p-values, thus allowing them to approximate the actual data distribution more closely.We choose p = 0.1 as the complexity of the resulting sub-networks is only mildly reduced compared to the deterministic full network.Studying the impact of sample size L = 4, 5, 8, 10, 20, we find RMSE (see top panel of Fig. 18) to be largely stable w.r.t.this parameter.For ECE (see bottom panel of Fig. 18), train scores grow with L, indicating a certain over-estimation of the present aleatoric uncertainties.This artefact is not generalized to test data though, where we observe broadly similar mean values and 75% quantiles.Under data shift, certain fluctuations of ECE occur as sample size L changes, however there is no clear trend.We thus choose the rather small L = 5 to keep the computational overhead down.The data splits in Fig. 3 and Fig. 19 are color-coded as follows: train is green, test is blue, PCA-interpolate is green-yellow, PCA-extrapolate is orange-yellow, label-interpolate is red and label-extrapolate is light red.The mapping between uncertainty methods and plot markers reads: SWAG is 'triangle', MC is 'diamond', MC-LL is 'thin diamond', DE is 'cross', PU is 'point', PU-DE is 'star', PU-MC is 'circle', PU-EV is 'pentagon' and Wdropout is 'plus'.The data base of this visualization are the 14 standard regression datasets.Some Wasserstein distances lie above the x-axis cut-off and are thus not visualized.

C.2. Discussion of NLL as a measure of uncertainty
Typically, DNNs using uncertainty are often evaluated in terms of their negative loglikelihood (NLL).This property is affected not only by the uncertainty, but also by the DNNs performance.Additionally, it is difficult to interpret, sometimes leading to counterintuitive results, which we want to elaborate on here.As a first example, take the likelihood of two datasets x 1 = {0} and x 2 = {0.5},each consisting of a single point, with respect to a normal distribution N (0, 1).Naturally, we find x 1 to be located at the maximum of the considered normal distribution and deem it the more likely candidate.But, if we extend these datasets to more than single points, i.e., x1 = {0, 0.1, 0, −0.1, 0} and x2 = {0.5, −0.4,0, −1.9, −0.7}, it becomes obvious that x2 is much more likely to follow the "better", it is highly data-(and prediction-)dependent which value is good in the sense of a reasonable correlation between performance and uncertainty.

Figure 2 :
Figure 2: Comparison of the proposed Wasserstein-based measure (WS) and the expected calibration error (ECE).We measure the deviation between a standard normal distribution N (0, 1) (lhs, red) and shifted normal distributions N (µ, 1) (top left, dark blue) and squeezed/stretched normal distributions N (0, σ) (bottom left, dark blue), respectively.The resulting ECE values (orange) and WS values (blue) on the rhs emphasize the higher sensitivity of WS in case of large distributional differences.For details on ECE and WS, see text.

Figure 3 :
Figure 3: Dependency between the Wasserstein-based measure and the expected calibration error for Gaussian toy data (gray curves) and for 1D standard datasets (point cloud, see subsection 4.3 for details).The toy curves are obtained by plotting (WS(σ), ECE(σ)) from Fig. 2 (bottom right).For 1D standard datasets, uncertainty methods are encoded via plot markers, data splits via color.Datasets are not encoded and cannot be distinguished (see appendix C for more details).Each plot point corresponds to a cross-validated trained network.

Figure 4 :
Figure 4: Comparison of uncertainty approaches (columns) on two 1D toy datasets: a noisy one (top) and a high-frequency one (bottom).Test data ground truth (respective first row) is shown with mean estimates (resp.second row) and standard deviations (resp.third row).The light green dashed curve (third row) indicates the ground truth uncertainty.Similar uncertainty approaches (columns) are grouped together, W-dropout is highlighted by a yellow frame.

Figure 5 :
Figure5: Scheme of two non-i.i.d.splits: a PCA-based split in input space (left) and labelbased split in output space (right).While datasets appear to be convex here, they are (most likely) not in reality.

Figure 6 :
Figure 6: Root-mean-square errors (RMSEs (↓), top row) and expected calibration errors (ECEs (↓), bottom row) of different uncertainty methods under i.i.d.conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details).W-dropout (light blue background) is compared to 8 benchmark approaches.Each blue cross is the mean over 14 1D regression datasets.Orange line markers indicate median values.The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line).

Figure 7 :
Figure 7: Negative log-likelihoods (NLLs (↓), top row) and Wasserstein distances (↓ , bottom row) of different uncertainty methods under i.i.d.conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details).W-dropout (light blue background) is compared to 8 benchmark approaches.Each blue cross is the mean over ECE values from 14 standard regression datasets.Orange line markers indicate median values.The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line).

Figure 8 :
Figure 8: Extrapolation behavior of W-dropout (orange) and MC dropout (blue).Two extrapolation "directions" (rows) and two datasets (columns) are considered.The vertical bar in each panel separates training data (left) from out-of-data (OOD, right).Scatter points show the predicted standard deviation for individual data points.The colored solid lines show averages over points in equally-sized bins and reflect the expected growth of epistemic uncertainty in the OOD-region.For details on the data splits and extrapolations please refer to subsection 4.3 and appendix A.3.

Figure 9 :
Figure 9: In-data and out-of-data evaluation of MC-SqueezeDet (lhs) and W-SqueezeDet (rhs) on six OD datasets.We consider regression quality (RMSE, top row) and uncertainty quality (ECE, bottom row).For each heatmap entry, the row label refers to the training dataset, the column label to the test dataset.Thus, diagonal matrix elements are in-data evaluations, non-diagonal elements are OOD analyses.W-SqueezeDet provides substantially smaller ECE values both in-data and out-of-data.

Figure 10 :
Figure 10: MC dropout and aleatoric uncertainty.The regularization parameter λ of MC dropout allows to model the average (homoscedastic) noise level of a dataset.As the regularizer is not input-dependent, it does not capture the x-dependency of the noise level, i.e. the heteroscedasticity of the dataset, see third row.

Figure 11 :
Figure 11: Standard deviation σ w-drop of W-dropout (y-axis) when fitted to a toy dataset with ground truth standard deviation σ gt (x-axis, see text for details).The bisecting line is shown in gray.While σ w-drop exhibits fluctuations (black 'whiskers' at 10% and 90% quantile), it provides on average accurate estimates of the ground truth uncertainty.Both mean value (blue cross) and median value (orange bar) of σ w-drop are close to the bisector.
8. For A2D2, 2D bounding boxes are inferred from semantic segmentation ground truth.

Figure 13 :
Figure 13: Prediction residuals (respective x-axis) and predictive uncertainty (respective yaxis) for different uncertainty mechanisms (columns) and datasets (rows).Each light blue dot in each plot corresponds to one test data point.Realistic uncertainty estimates should lie mostly above the blue 3σ-lines.The datasets toy-noise, naval, abalone and superconduct are shown, from top to bottom.

Figure 14 :
Figure 14: Two exemplary object detection images from BDD100k (top row, real-world image) and SynScapes (bottom row, synthetic image), respectively.For each original image (left column), two corrupted versions are generated: a blurred one (middle column) and a noisy one (right column), see text for details.

Figure 15 :
Figure 15: In-data and out-of-data evaluation of MC-SqueezeDet (lhs) and W-SqueezeDet (rhs) on six OD datasets.We consider the negative log-likelihood (NLL, top row) and the Wasserstein measure (WS, bottom row).For each heatmap entry, the row label refers to the training dataset and the column label to the test dataset.Thus, diagonal matrix elements are in-data evaluations, non-diagonal elements are OOD analyses.

Figure 17 :
Figure 17: Dependence of Wasserstein dropout on drop rate p. Root-mean-square errors (RMSEs (↓), top row) and expected calibration errors (ECEs (↓), bottom row) are shown for neuron drop rates of p = 0.05, 0.1, 0.2, 0.3, 0.4, 0.5 under i.i.d.conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details).W-dropout with p = 0.1 (used for evaluations on toy and 1D regression data) is highlighted by a light blue background.Each blue cross is the mean over 10 standard regression datasets.Orange line markers indicate median values.The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line).

Figure 18 :
Figure 18: Dependence of Wasserstein dropout on sample size L. Root-mean-square errors (RMSEs (↓), top row) and expected calibration errors (ECEs (↓), bottom row) are shown for sample sizes of L = 4, 5, 8, 10, 20 under i.i.d.conditions (first and second panel in each row) and under various kinds of data shift (third to sixth panel in each row, see text for details).W-dropout with L = 5 (used throughout the rest of the paper) is highlighted by a light blue background.Each blue cross is the mean over 10 standard regression datasets.Orange line markers indicate median values.The gray vertical bars reach from the 25% quantile (bottom horizontal line) to the 75% quantile (top horizontal line).

Figure 19 :
Figure 19: Dependencies between the three uncertainty measures ECE, Wasserstein distance and Kolmogorov-Smirnov distance.Uncertainty methods are encoded via plot markers, data splits via color.Datasets are not encoded and cannot be distinguished (see text for more details).Each plot point corresponds to a crossvalidated trained network.The clearly visible deviations from ideal correlations point at the potential of these uncertainty measures to complement one another.

Table 1 :
Regression performance (RMSE) and uncertainty quality (NLL, ECE, WS) of W-dropout and various uncertainty benchmarks.W-dropout yields the best uncertainty scores while providing a competitive RMSE value.Each number is the average across 14 standard 1D (test) datasets.The figures in this table correspond to the blue crosses in the second columns of Fig.6and Fig.7, respectively.See text for further details.

Table 2 :
Out-of-data analysis of W-dropout and various uncertainty benchmarks.Regression performance (RMSE) and uncertainty quality (NLL, ECE, WS) are displayed.As for in-domain test data, W-dropout outperforms the other uncertainty methods without sacrificing regression quality.Each number is obtained by two-fold averaging: firstly, across two types of out-of-data test sets (label-based and PCAbased splits) and secondly, across 14 standard 1D datasets.The figures in this table are based on the blue crosses in the last four columns of Fig.6and Fig.7, respectively.See text for further details.fifth panel) amplify the deterioration in performance across all methods.Again, W-dropout yields competitive RMSE values (see also Tab. 2).

Table 3 :
Study of worst-case scenarios for different uncertainty methods: W-dropout (W-Drop), PU-DE and PU-MC are compared to the ideal Gaussian case for i.i.d. and non-i.i.d.data splits.Uncertainty quality in these scenarios is quantified by the expected tail loss at the 99% quantile (ETL 0.99 ).Each mean and max value is taken over the ETLs of 110 models trained on 15 different datasets.
(Stephens, 1974)emphasize on different aspects of the considered samples: NLL is biased towards well-performing models, ECE measures deviations within quantile ranges, Wasserstein distance resolves distances between normalized residuals and ETL focuses on distribution tails.The empirically observed dependencies between WS and ECE are visualized in Fig.3.Additionally to WS and ECE, we consider Kolmogorov-Smirnov (KS) distances(Stephens, 1974)on normalized residuals in Fig.19in appendix C.

Table 4 :
Basic statistics of the harmonized object detection datasets.Dataset size and number of annotated objects are reported for train data (first two columns) and test data (last two columns).For details on dataset harmonization, see text and references therein.
not only in-domain but on corrupted and augmented test data as well as on other object detection datasets (see last paragraphs of this subsection).

Table 5 :
Regression performance and uncertainty quality of SqueezeDet-type networks on KITTI data.W-SqueezeDet (W-SqzDet) is compared with the default MC-SqueezeDet (MC-SqzDet).The values of NLL, ECE and WS are aggregated across their respective four dimensions, for details see appendix A.5 and Table12 therein.
In-data evaluation To assess model performance, we report the mean intersection over union (mIoU) and RMSE (in pixel space) between predicted bounding boxes and matched ground truths.The quality of the uncertainty estimates is measured by (coordinate-wise) NLL, ECE, WS and ETL.Table5shows a summary of our results on train and test data for the KITTI dataset.The results for NLL, ECE, WS and ETL have been averaged across the 4 regression coordinates.MC-SqueezeDet (abbreviated as MC-SqzDet) and W-SqueezeDet (W-SqzDet) show comparable regression results in terms of RMSE and mIoU, with slight advantages for MC-SqueezeDet.Considering uncertainty quality, we find substantial advantages for W-SqueezeDet across all evaluation measures.These advantages are due to the estimation of heteroscedastic aleatoric uncertainty during training (see also the test statistics 'trajectories' during training for BDD100k in Fig.16in appendix A.5).The test RMSE and ECE values of all six OD datasets are visualized as diagonal elements in Fig.9.The (mostly) 'violet' RMSE diagonals for MC-SqueezeDet and W-SqueezeDet (top row of Fig.9) again indicate comparable regression performances.Datasets are ordered by size from small (top) to large (bottom).The large NuImages test set occurs to be the most challenging one.

Table 6 :
Out-of-data evaluation of MC-SqueezeDet (MC-SqzDet) and W-SqueezeDet (W-SqzDet) on distorted OD datasets.Each model is trained on the original dataset and evaluated on two modified versions of the respective test set: a blurred one (first two columns) and a noisy one (last two columns), see text for details.We report the expected calibration error (ECE) and find W-SqueezeDet to perform better than MC-SqueezeDet on most datasets.In contrast to the analysis above, we now focus on 'non-semantic' data shifts due to technical distortions.For each test set, we generate a blurred and a noisy version.7Twoexamples of these transformations can be found in Fig.14in appendix A.5.In accordance with previous results, W-SqueezeDet provides smaller ECE values compared to MC-SqueezeDet on most blurred and noisy test sets (see Table

Table 7 :
Regression performance and uncertainty quality of networks with different uncertainty mechanisms.All scores are calculated on the test set of toy-hf and toy-noise, respectively.

Table 8 :
Details on 1D regression datasets.Ground truth (gt) is partially preprocessed to match the 1D regression setup.Finally, both random and sequence-based train-test splits are considered.This variety is moreover reflected in the numerous object classes the different datasets provide.Their mappings to three main categories ('pedestrian', 'cyclist', 'vehicle') can be found in Table11.Rare or irregular classes are removed.For KITTI, we moreover discard 'van', 'truck' and 'person-sitting', following the original SqueezeDet paper.To analyze uncertainty quality on distorted images, blurred and noisy versions of the test datasets are created.Fig.14showsthese transformations for two exemplary images from BDD100k (top row) and SynScapes (bottom row), respectively.

Table 9 :
Regression performance and uncertainty quality of networks with different uncertainty mechanisms.The scores are calculated on the test sets of 14 standard regression datasets.

Table 10 :
General information on the object detection datasets.

Table 11 :
Harmonization of the object detection datasets.The various object classes of the six object detection datasets (rows) are grouped into the three main categories "vehicle", "pedestrian" and "cyclist" (columns).Some classes are too rare or irregular and are thus discarded.