1 Introduction

Recently, more and more software systems are Deep Learning based Software Systems (DLS), i.e., they contain at least one Deep Neural Network (DNN), as a consequence of the impressive performance that DNNs achieve in complex tasks, such as image, speech or natural language processing, in addition to the availability of affordable, but highly performant hardware (i.e., GPUs) where DNNs can be executed. DNN algorithms can identify, extract and interpret relevant features in a training data set, learning to make predictions about an unknown function of the inputs at system runtime. Given the complexity of the tasks for which DNNs are used, predictions are typically made under uncertainty, where we distinguish between epistemic uncertainty, i.e., model uncertainty which may be removed by better training of the model, possibly on better training data, and aleatoric uncertainty, which is model-independent uncertainty, inherent in the prediction task (e.g., the prediction of a non-deterministic event). The former uncertainty is due to out-of-distribution (OOD) inputs, i.e., inputs that are inadequately represented in the training set.

The latter may be due to ambiguity, i.e., an input for which multiple labels are all possibly correct (which could be understood as identical inputs having different, but correct, labels or \({-}\) more generally \({-}\) inputs having probabilistic labels).

This is a major issue often ignored during DNN testing, as recently recognized by Google AI Scientists: “many evaluation datasets contain items that (...) miss the natural ambiguity of real-world context” Aroyo and Paritoshs (2021).

The existence of uncertainty led to the development of DNN Supervisors (in short, supervisors), which aim to recognize inputs for which the DL component is likely to make incorrect predictions, allowing the DLS to take appropriate countermeasures to prevent harmful system misbehavior (Stocco et al. 2020; Henriksson et al. 2019, 2019a; Weiss and Tonella 2021, 2022; Catak et al. 2021; Hell et al.2021; Hussain et al. 2022). For instance, the supervisor of a self-driving car might safely disengage the auto-pilot when detecting a high uncertainty driving scene (Stocco et al. 2020; Wintersberger et al. 2021). Other examples of application domains where supervision is crucial include medical diagnosis (Davidson et al. 2021; Brown and Leontidis 2021) and natural hazard risk assessment (Bjarnadottir et al. 2019).

While most recent literature on uncertainty driven DNN testing is focused on out of distribution detection  (Henriksson et al. 2019, Henriksson et al. 2019a; Berend et al. 2020; Stocco et al. 2020; Zhang et al. 2018; Weiss and Tonella 2021, 2022; Kim et al. 2018; Kim and Yoo 2021; Dola et al. 2021), studies considering true ambiguity are lacking, which poses a big practical risk: We cannot expect that supervisors which perform well in detecting epistemic uncertainty are guaranteed to perform well at detecting aleatoric uncertainty. Actually, recent literature suggests the opposite (Mukhoti et al. 2021). The lack of studies considering true ambiguity is related to \({-}\) if not caused by \({-}\) the unavailability of ambiguous test data for common case studies: While to create ODD data, such as corrupted and adversarial inputs, a variety of precompiled dataset and generation techniques are publicly available (Mu and Gilmer 2019; Hendrycks and Dietterich 2018; Rauber et al. 2017), and invalid or mislabelled data is trivial to create in most cases, we are not aware of any approach targeting the generation of true ambiguity in a way that is sufficient for reliable and fair supervisor assessment. In this paper we aim to close this gap by making the following contributions:

Approach

We propose AmbiGuess, a novel approach to generate diverse, labelled, ambiguous images for image classification tasks. Our approach is classifier independent, i.e., it aims to create data which is ambiguous to a hypothetical, perfectly well trained oracle (e.g., a human domain expert), and which does not just appear ambiguous to a specific, suboptimally trained DNN.

Datasets

Using AmbiGuess, we generated and released two ready-to-use ambiguous datasets for common benchmarks in deep learning testing: MNIST (LeCun et al. 1998), a collection of grayscale handwritten digits, and Fashion-MNIST (Xiao et al. 2017), a more challenging classification task, consisting of grayscale fashion images.

Supervisor Testing

Equipped with our datasets, we measured the capability of 16 supervisors at detecting different types of high-uncertainty inputs, including ambiguous ones. Our results indicate that there is complementarity in the supervisors’ capability to detect either ambiguity or corrupted inputs.

2 Background

Ambiguous Inputs

In many real-world applications, the data observed at prediction time might not be sufficient to make a certain prediction, even assuming a hypothetical optimal oracle such as a domain expert with exhaustive knowledge: If some information required to make a correct prediction is missing, such missing information can be seen as a random influence, thus introducing aleatoric uncertainty in the prediction process.

Formally, in a given classification problem, i.e., a machine learning (ML) problem where the output is the class c the input x is predicted to belong to, let \(P(c \mid x)\) denote the probabilistic label, indicating the probability that x belongs to c in the ground truth’s underlying distribution, where observation \(x \in \mathbb {O}\) and \(\mathbb {O}\) denotes the observable space, i.e., the set of all possibly observable inputs. We define true ambiguity as follows:

Definition 1

(True Ambiguity in Classification) A data point \(x \in \mathbb {O}\) is truly ambiguous if and only if \(P(c \mid x) \)>0 for more than one class c.

Thus, inputs to a classification problem are considered truly ambiguous if and only if such input is part of an overlap between two or more classes. We emphasize true ambiguity to indicate ambiguity intrinsic to the data and independent from any model and its classification confidence/accuracy. In this way we distinguish ours from other papers which also use the term ambiguous with different meaning, such as low confidence inputs, mislabelled inputs, where a label in the training/test set is clearly wrong, i.e, the corresponding probability in \(P(c \mid x)\) is 0 (Seca 2021), or invalid inputs, where no true label exists for a given input.Footnote 1 In simple domains, where humans may have no epistemic uncertainty (i.e., they know the matter perfectly), true ambiguity is equivalent to human ambiguity. In the remainder of this paper we focus only on true ambiguity and if not otherwise mentioned we use the term ambiguity as a synonym for true ambiguity.

Out-of-Distribution (OOD) Inputs

A prediction-time input is denoted OOD if it was insufficiently represented at training time, which caused the DNN not to generalize well on such types of inputs. This is the primary cause of epistemic uncertainty. OOD test data is used extensively to measure supervisor performance in academic studies, e.g. by modifying nominal data in a model-independent, realistic and label-preserving way (corrupted data) (Zhang et al. 2018; Mu and Gilmer 2019; Hendrycks and Dietterich 2018; Stocco et al. 2020) or by minimally modifying nominal data to fool a specific, given model (adversarial data). In practice, both OOD and true ambiguity are important problems when building DLS supervisors (Humbatova et al. 2020).

Decision Frontier

Much recent literature works on the characterization of the decision frontier of a given model, i.e., its boundary of predictions between two classes in the input space (Karimi et al. 2019; Kang et al. 2020; Byun and Rayadurgam 2020; Riccio and Tonella 2020). It is imporant to note that the decision frontier is not equivalent to the sets of ambiguous inputs: The decision frontier is model specific, while ambiguity depends only on the problem definition and is thus independent of the model, i.e., the fact that an input is at a specific model’s frontier, does not guarantee that it is indeed ambiguous (it may also be unambiguous, i.e., belong to a specific class with probability 1, or invalid, i.e., have 0 probability to belong to any class). The decision frontier may thus be considered the “model’s ambiguity”, while true ambiguity implies that an input is perceived as ambiguous by a hypothetical, perfectly well trained domain expert (hence matching “human ambiguity” in many classification tasks).

3 Related Work

The research works that are most related to our approach deal with automated test generation for DNNs (Mu and Gilmer 2019; Hendrycks and Dietterich 2018; Tian et al. 2018; Zhang et al. 2018; Stocco et al. 2020; Rauber et al. 2017). In these works, some reasons for uncertainty, such as ambiguity, are not considered. Hence, automatically generated tests do not allow meaningful evaluations under ambiguity of DNN supervisors, as well as of the DNN behavior, in the absence of supervisors. We illustrate this in Fig. 1: Using an off-the-shelf MNIST (LeCun et al. 1998) classifier, we calculated the predictive entropy to identify the 3% of samples (300 out of 10’000) with the presumably highest aleatoric uncertainty in the MNIST test set. Predictive entropy (i.e., the entropy of the Softmax values interpreted as probabilities) is a standard metric used in the related literature (Mukhoti et al. 2021) to detect aleatoric uncertainty which is caused, amongst other reasons, by truly ambiguous images. Out of these 300 images, we manually selected the ones we considered potentially ambiguous, and show them in Fig. 1. Clearly, some of them are ambiguous, showing that ambiguity exists and is present in the MNIST test set, but the scarcity of truly ambiguous inputs indicates that supervisors cannot be confidently tested for their capability of handling ambiguity using this test set. The manual selection of the 18 (subjectively) most ambiguous images was required to exclude the 282 images that also had high entropy, but did not appear truly ambiguous: For some of them, the high entropy was clearly caused by image invalidity. For others, the high entropy was caused by the model’s inability to assign a high likelihood to a single class for an unambiguous, nominal image. The latter serves as an example showing that using just the Softmax value to detect ambiguity might not be ideal and highlights the need for an empirical comparison of the different supervisors’ capability to detect ambiguity (see Section 7).

Fig. 1
figure 1

The 18 most ambiguous images, manually selected from the 300 (3%) samples with the highest predictive entropy in the MNIST test set (LeCun et al. 1998). Only a few of them are clearly ambiguous, showing that ambiguous data are scarce in existing datasets. Underlined numbers show the actual label, non-underlined numbers show classes we consider possibly having a non-zero probability as well (making the image ambiguous)

In the DNN test input generators (TIG) literature (Mu and Gilmer 2019; Hendrycks and Dietterich 2018; Tian et al. 2018; Zhang et al. 2018; Stocco et al. 2020; Dunn et al. 2021), with just one notable preprint as an exception (Mukhoti et al. 2021), we are not aware of any paper aiming to generate true ambiguity directly, while most TIG aim for other objectives. Some works (Mu and Gilmer 2019; Hendrycks and Dietterich 2018; Weiss and Tonella 2022) propose to corrupt nominal input in predefined, natural and label-preserving ways to generate OOD test data. DeepTest (Tian et al. 2018) applies corruptions to road images, e.g., by adding rain, while aiming to generate data that maximizes neuron coverage. Also targeting road images, DeepRoad (Zhang et al. 2018) is a framework using Generative Adversarial Networks (GAN) to change conditions (such as the presence of snow) on nominal images. The Udacity Simulator, used by Stocco et al. (2020), allows to dynamically add corruptions, such as rain or snow, when testing self-driving cars. Similar to DeepTest, TensorFuzz (Xie et al. 2019) and DeepHunter (Odena et al. 2019) generate data with the objective to increase test coverage. Again, aiming to generate diverse and unseen inputs, these approaches will mostly generate OOD inputs and only occasionally \({-}\) if at all \({-}\) truly ambiguous data.

A fundamentally different objective is taken in adversarial input generation (Goodfellow et al. 2014), where nominal data is not changed in a natural, but in a malicious way. Based on the tested model, nominal input data is slightly changed to cause misclassifications. Literature and open source tools provide access to a wide range of different specific adversarial attacks (Rauber et al. 2017). While very popular, neither input corruptions nor adversarial attacks generate intentionally ambiguous data from nominal, typically non-ambiguous inputs. As they rely on the ground truth label of the modified input to remain unchanged, they do not aim at creating true ambiguity, as affecting the ground truth label would imply unsuccessful test data generation.

Another popular type of test data generators aims to create inputs along the decision boundary: DeepJanus (Riccio and Tonella 2020) uses a model based approach, while SINVAD (Kang et al. 2020) and MANIFOLD (Byun and Rayadurgam 2020) use the generative power of variational autoencoders (VAE) (Kingma and Welling 2013). Note that we cannot expect inputs along the decision boundary to be always truly ambiguous \({-}\) they may just as well be OOD, invalid or in rare cases even low-uncertainty inputs. In addition, these approaches are by design model specific, making them unsuitable to generate a generally applicable, model-independent, ambiguous dataset.

Thus, out of all the approaches discussed above, none aims to generate a truly ambiguous dataset. A notable exception is a recent, yet unpublished, preprint by Mukhoti et al. (2021). In their work, to evaluate the uncertainty quantification approach they propose, they needed an ambiguous MNIST dataset. To that extent, they used a VAE to generate a vast amount of data (which also contains invalid, OOD and un-ambiguous data) which they then filter and stratify based on two mis-classification prediction techniques, aiming to end up with a dataset consisting of ambiguous images. We argue that, while certainly valuable in the scope of their paper, the so-created dataset is not sufficient as a standard benchmark for DNN supervisors, as the approach itself relies on a supervision technique, hence being circular if used for DNN supervisor assessment. In fact, the created ambiguity may be particularly hard (or easy) to be detected by supervisors using different (resp. similar) MP techniques. We anyway compared their approach to ours empirically and found that it is less successful in generating truly ambiguous test data than ours.

4 Uses of Ambiguous Test Sets

In this paper we focus on the usage of ambiguous test data for the assessment of DNN supervisors, but ambiguous data have also other uses, including the assessment of test input prioritizers.

4.1 Assessment of DNN Supervisors

We cannot assume that results on DNN supervisors’ capabilities obtained on nominal and OOD data generalize to ambiguous data. Recent studies (Zhang et al. 2020; Weiss and Tonella 2021, 2022) have shown that there is no clear performance dominance amongst uncertainty quantifiers used as DNN supervisors, but such studies overlook the threats possibly associated with the presence of ambiguity. Warnings on such threats in medical machine learning based systems were raised already in 2000 (Trappenberg and Back 2000), with ambiguity in a cancer detection dataset mentioned as a specific example. The authors proposed to equip the system with an ambiguity-specific supervisor, to “detect and re-classify as ambiguous” (Trappenberg and Back 2000) such threatening data. To test such supervisors, such as the one proposed by Mukhoti et al. (2021), model and MP independent and diverse ambiguous data is needed.

4.2 Assessment of DNN Input Prioritizers

Test input prioritizers, possibly based on MP, aim to prioritize test cases (inputs) in order to allow developers to detect mis-behaviours (e.g., mis-classifications) as early as possible. Hence, they should be able to recognize ambiguous inputs. Correspondingly, test input prioritizers should be assessed also on ambiguous inputs. On the contrary, when the goal is active learning, an ambiguous input should be given the least priority or excluded at all, as the aleatoric uncertainty causing its mis-classification cannot by definition be avoided using more training data. Thus, recognition of ambiguous test data is clearly of high importance when developing a test input prioritizer, be that to make sure that the ambiguous samples are given a high priority (during testing) or a low priority (during active learning).

Fig. 2
figure 2

Schematic segmentation of a valid input space: If two classes are separated by ambiguous inputs, a decision boundary of classifier C outside of these ambiguous inputs implies (unambiguous) misclassifications

4.3 Decision-Boundary Oracle

Much recent literature works on the characterization of the decision boundary of a given model, i.e., its frontier of predictions between two classes in the input space (Karimi et al. 2019; Kang et al. 2020; Byun and Rayadurgam 2020; Riccio and Tonella 2020). Given that for an ambiguous sample, two or more classes can be considered as true labels, we would expect all ambiguous samples to lie close to the decision boundary of a well trained classifier. Similarly, considering only the valid input space, i.e., the subset of the input space which contains the valid inputs for the given classification problem, the presence of an ambiguous space (AS), i.e., of truly ambiguous samples, implies that the decision boundary of said classifier must go through the AS. This is illustrated in Fig. 2, which shows an ambiguous space and the decision boundary of a suboptimally trained classifier C. The fact that the decision boundary is not always within the AS implies that the inputs lying between the decision boundary and the AS consist of unambiguous samples misclassified by C. Moreover we know that adding data from these enclosed (clear misclassification) areas to the training set will increase the performance of C, which is not necessarily true for samples within the AS. Hence, knowledge about the ambiguity of data near the decision boundary is important to assess the quality of a model and possibly to improve it, when unambiguous data is found at the frontier.

It should be noticed that in this paper, we do not make any assumptions about the decision boundary and its connection to truly ambiguous inputs and Fig. 2 serves only as illustration of a situation that may occur in practice. Indeed, in Section 7, we compare supervisors directly relying on the predicted probabilities and thus also on the decision boundary (such as Vanilla Softmax) with some that do not (such as autoencoders (Stocco et al. 2020)). Indeed, the good results of Vanilla Softmax and other techniques relying on the decision boundary do suggest that ambiguous samples are very likely to lie close to such boundary.

4.4 Disentanglement and Reasoning

The identification of ambiguity can be seen as a special case of uncertainty disentanglement and reasoning (Lines 2019; Clements et al. 2019), the former being the quantification of epistemic vs aleatoric uncertainty, and the latter being the separation of uncertainty into specific root causes, such as data invalidity, OOD, or true ambiguity. Recent work has used uncertainty disentanglement to guide training in reinforcement learning (Lines 2019; Clements et al. 2019), building on the idea that only data leading to epistemic uncertainty is useful to drive model performance improvement during continuous learning tasks. Let us consider the following example:

Example 1

(Use of Uncertainty Reasoning) A medical DLS determines if a patient has a specific type of cancer, provided some ultrasonic images.

  1. (1)

    Assume the ultrasonic image reveals an implant of the patient \({-}\) something which is underrepresented in the training set, making the input OOD and potentially leading the DNN to mistake the implant as something relevant for cancer detection: Being able to reliably detect that the input is OOD, the system could ask a (human) expert to label the image. Said label would then be a more reliable prediction, as the human is not confused by the implant. In addition, the now labelled OOD sample can be used in further training loops of the DNN.

  2. (2)

    Assume the input is in-distribution, but there’s not enough information on the image to decide if the patient has cancer: The image is truly ambiguous. By recognizing this true ambiguity, the DLS may make a reliable probabilistic prediction, which would allow the patient to make an informed decision on whether to conduct further diagnosis or treatment.

  3. (3)

    Assume the DNN is given an image which is not an ultrasonic image. Detecting that this input is invalid allows the system to refuse to make any (even probabilistic) prediction and raise an alert.

Case (2) is a particularly realistic case: In AI-guided healthcare, decisions about future treatment and diagnosis are typically made based on probabilistic predictions (de Hond et al. 2022), which can only be trusted if the input is in-distribution.

Another reason for fine-grained uncertainty reasoning is DLS debugging: Informing the developers of a DLS about the root causes of uncertainties and mispredictions would greatly facilitate further improvement of the DLS, especially because DNNs are known for their low explainability (Samek et al. 2017), which makes debugging particularly challenging when dealing with them. Clearly, to develop and test any technique working with uncertainty disentanglement or uncertainty reasoning, the availability of ambiguous data in the test set is a strict prerequisite, and the lack of such datasets is likely the main reason why such research is so scarce.

5 Generating Ambiguous Test Data

We designed AmbiGuess, a TIG targeting ambiguous data for image classification, based on the following design goals (DG):

DG\(_1\) (labelled ambiguity)

The generated data should be truly ambiguous and have correspondingly probabilistic labels, i.e., each generated data is associated with a probability distribution over the set of labels. Probabilistic labels are the most expressive description of true ambiguity and a single or multi-class label can be trivially derived from probabilistic labels.

DG\(_2\) (model independence)

To allow universal applicability of the generated dataset, our TIG should not depend on any specific DNN under test.

DG\(_3\) (MP independence)

The created dataset should allow fair comparison between different supervisors. Since supervisors are often based on MPs (e.g., uncertainty or confidence quantifiers), our TIG should not use any MP as part of the data generation process, to avoid circularity, which might give some supervisor an unfair advantage or disadvantage over another one.

DG\(_4\) (diversity)

The approach should be able to generate a high number of diverse images.

5.1 Interpolation in Autoencoders

Autoencoders (AEs) are a powerful tool, used in a range of TIG (Kang et al. 2020; Byun and Rayadurgam 2020; Mukhoti et al. 2021; Dunn et al. 2021). AEs follow an encoder-decoder architecture as shown in the blue part of Fig. 3: An encoder E compresses an input into a smaller latent space (LS), and the decoder D then attempts to reconstruct x from the LS. The reconstruction loss, i.e., the difference between input x and reconstruction \(\hat{x}\) is used as the loss to be minimized during training of the AE.

Fig. 3
figure 3

Autoencoder (blue) and its extension to a Regularized Adversarial Autoencoder (green)

Fig. 4
figure 4

Image Sampling in the Latent Space

Fig. 5
figure 5

Interpolation between two classes in the latent space of a 2-class Regularized Adversarial Autoencoder

On a trained AE, sampling arbitrary points in the latent space, and using the decoder to construct a corresponding image, allows for cheap image generation. This is shown in Fig. 4, where the shown images are not part of the training data, being reconstructions based on randomly sampled points in the latent space. In the following section, we leverage the generative capability of AEs, by proposing an architecture that can target ambiguous samples specifically and can label the generated data probabilistically (\(DG_1\)).

5.2 AmbiGuess

Our TIG AmbiGuess consists of three components: (1) The Regularized LS Generation component, which trains a specifically designed AE to have a LS that facilitates the generation of truly ambiguous samples. (2) The Automatic Labelling component, which leverages the AE architecture to support probabilistic labelling of any images produced by the AE’s decoder. (3) The Heterogenous Sampling component, which chooses samples in the LS in a way that leads to high diversity of the generated images.

An overview of AmbiGuess, which leverages these three components, is outlined in Fig. 6.

5.2.1 Regularized Latent Space Generation

Interpolation from one class to another in the latent space, i.e., the gradual perturbation of the reconstruction by moving from one cluster of latent space points to another one, may produce ambiguous samples between those two classes (satisfying both \(DG_2 \) and \(DG_3\)). An example of such an interpolation is shown in Fig. 5. Clearly, we want the two clusters to be far from each other, providing a wide range for sampling in between them, and no other cluster should be in proximity, as it would otherwise influence the interpolation. However, these two conditions are usually not met by traditional autoencoders used in other TIG approaches. For example, Fig. 4 shows the LS of a standard variational autoencoder (a popular architecture in TIG). Here, interpolating between classes 0 and 7 would, amongst others, cross the cluster representing class 4, and thus samples taken from the interpolation line would clearly not be ambiguous between 0 and 7, but would be reconstructed as a 4 (or any of the other clusters lying between them). We solve these requirements by using 2-class Regularized Adversarial AEs:

2-class AE

Instead of training one AE on all classes, we train multiple AEs, each one with the training data of just two classes. This has a range of advantages: First and foremost, it prevents interferences with third classes. Then, as the corresponding reduced (2-class) datasets have naturally a lower variability (feature density), 2-class autoencoders are expected to require fewer parameters and show faster convergence during training. Further, the fact that the number of combinations of classes \({c}\atopwithdelims (){2}\) grows exponentially in the number of classes c is of only limited practical relevance: In very large, real-world datasets, ambiguity is much more prevalent between some combinations of classes than between others, so not all pairwise combinations are equally interesting for the test generation task. For example, let us consider a self driving car component which classifies vehicles on the road. While an image of a vehicle where one cannot say for sure weather it is a pick-up or a SUV (hence having true ambiguity) is clearly a realistic case, an image which is truly ambiguous between a SUV and a bicycle is hard to imagine. This phenomenon is well known in the literature, as it leads to heteroscedastic aleatoric uncertainty (Ayhan and Berens 2018), i.e., aleatoric uncertainty which is more prevalent amongst some classes than amongst others. In such a case, using AmbiGuess, one would only construct the 2-class AEs for selected combinations where ambiguity is realistic.

Regularized Adversarial AE (rAAE)

To guide the training process towards creating two disjoint clusters representing the two classes, with an adequate amount of space between them, we use a Regularized Adversarial Autoencoder (rAAE) (Makhzani et al. 2015). The architecture of an rAAE is shown in Fig. 3: Encoder E, Decoder D and the LS are those of a standard AE. In addition, similar to other adversarial models (Goodfellow et al. 2014), a discriminator Disc is trained to distinguish labelled, encoded images z from samples drawn from a predefined distribution p(z|y). Specifically, we define p(z|y) as a multi-modal (2 classes) multi-variate (number of dimensions in latent space) gaussian distribution, consisting of \(p(z|c_1)\) and \(p(z|c_2)\) for classes \(c_1\) and \(c_2\), respectively. Then, training a rAAE consists of three training steps, which are executed on every training epoch: First, similar to a plain AE, E and D are trained to reduce the reconstruction loss. Second, Disc is trained to discriminate encoded images from samples drawn from p(z|y), and third, E is trained to fool Disc, i.e., E is trained with the objective that the training set projected onto the latent space matches the distribution p(z|y). This last property can be leveraged for ambiguous test generation: Given two classes \(c_1\) and \(c_2\), to clear up space between them in the latent space we can choose a p(z|y) such that \(p(z|c_1) \)>\( \epsilon \) on LS points disjoint from the LS points where \(p(z|c_2) \)>\( \epsilon \), for some small \(\epsilon \). For example, assume a two-dimensional latent space: Choosing \(p(z|c_1) = \mathcal {N}([-3,0], [1,1])\) and \(p(z|c_2) = \mathcal {N}([3,0], [1,1])\) will, after successful training, lead to a latent space where points representing \(c_1\) are clustered around \((-3,0)\) and points representing \(c_2\) around (3, 0), with few if any points between them, i.e., around (0, 0). This makes reconstructions around (0, 0) potentially highly ambiguous.

Fig. 6
figure 6

High-Level Illustration of AmbiGuess

5.2.2 Probabilistic Labelling of Images

The Disc of a 2-class rAAE can be used to automatically label the images generated by its decoder: Given a latent space sample \(z^*\) on a 2-class rAAE for classes \(c_1\) and \(c_2\), \(Disc(z^*, c_1)\) approximates \(p(z^*|c_1)\). Assuming \(p(c_1) = p(c_2) = 0.5\), we have \(p(z^*|c_1) = p(c_1 | z^*)\). Hence, \(Disc(z^*, c_1)\) approximates the likelihood that \(z^*\) belongs to class \(c_1\). The same holds for \(Disc(z^*, c_2)\). Normalizing these two values s.t. they add up to 1 thus provides a probability distribution over the classes (thus realizing \(DG_1\)). This is used in Steps 4 and 7 of Fig. 6.

Clearly, this probabilistic labelling depends on the discriminator being well trained, i.e., its ability to discriminate between images of classes \(c_1\) and \(c_2\). Thus, we propose to assess the discriminator’s training success by measuring its accuracy at classifying nominal (non-ambigous) inputs as one of the two classes. Then, rAAEs for which the discriminators accuracy does not meet a (tunable) threshold can be discarded (Steps 3 and 4 in Fig. 6).

Fig. 7
figure 7

Illustration of the Weights-Calculation in the Latent Space

5.2.3 Selecting Diverse Samples in the LS

Diversity in a generated dataset (see \(DG_4\)) is in general hard to achieve when generating a dataset by sampling the LS, as the distance between two points in the LS does not directly translate to a corresponding difference between the generated images. While in some parts of the LS, which we denote as high density parts, moving a point slightly in the LS space can lead to clearly visible changes in the decoder’s output, in low density parts, large junks of the LS lead to very similar reconstructed images.

To that extent, we do not sample in the latent space uniformly, but in a weighted way aiming to select diverse images (Step 4 in Fig. 6). Specifically, we set up the sampling of points in the latent space in four steps as outlined below, also illustrated in Fig. 7:

  1. A.

    Confined Latent Space The size of latent space is practically infinite, being bound only by its numerical representation. However, we are only interested in a small part of this latent space, namely the area in between the two ’nominal’ clusters in our 2-class rAAE. This is represented in Step A of Fig. 7. This area, which we denote as confined latent space (CLS), can be defined analytically from the distributions imposed on the 2-class rAAE during training.Footnote 2 In the subsequent steps, we consider only the CLS.

  2. B.

    Grid Cells and Anchors We divide the (rectangular) CLS into a grid of rectangular grid cells, where the number of grid cells is a tunable hyperparameter. Then, for every grid cell we identify the point in the center. In the next two steps, we will use this anchor point as a representative of the grid cell when estimating the density in the grid cell, as well as the ambiguity in the images reconstructed from points within the grid cell. With this, we will build a weight for each cell, based on which cells are selected during sampling. Within a grid cell, the actually drawn point will then be chosen uniformly at random.

  3. C.

    Anchor Ambiguity We calculate the probabilistic label for each anchor, as explained in Section 5.2.2. For labels which are not sufficiently ambiguous, i.e., where the difference between the two class probabilities is higher than some threshold \(\delta _{max}\), the corresponding grid cells are ignored (their sampling weight is set to zero). Thus, \(\delta _{max}\) is a hyperparameter allowing us to steer the minimum level of ambiguity in the anchors of the cells used for sampling. Note that, as points in the grid cells exhibit lower ambiguity than the corresponding anchor, \(\delta _{max}\) does not aim to ensure this level of ambiguity in the resulting test set; this is ensured with a final filtering (Steps 7 and 8 in Fig. 6). However, \(\delta _{max}\) enables the sampling algorithm to consider only regions (i.e., grid cells) of the CLS with a high likelihood to generate ambiguous images, making it overall much more efficient.

  4. D.

    Anchor Gradient We want to focus our sampling on high density regions of the latent space where small changes in the latent space representation lead to more notable changes in the reconstructed images than in low density regions. We thus estimate the density of each none-ignored grid cell by calculating the norm of the decoders gradient at the corresponding anchor point. We use the euclidean distance to measure differences between decoder outputs (i.e., images), which is required to calculate the gradient. We then use these density estimates (i.e., these norms) as weights when choosing grid cells during sampling.

5.3 Pre-Generated Ambiguous Datasets

We built and released two ready-to-use ambiguous datasets for MNIST (LeCun et al. 1998), the most common dataset used in software testing literature (Riccio et al. 2020), where images of handwritten numbers between 0 and 9 are to be classified, and its more challenging drop-in replacement Fashion MNIST (FMNIST) (Xiao et al. 2017), consisting of images of 10 different types of fashion items.

AmbiGuess configuration

For each pair of classes, we trained 20 rAAEs to exploit the non-determinism of the training process to generate even more diversified outputs. To make sure we only use rAAEs where the distribution in the LS is as expected, we check if the discriminator cannot distinguish LS samples obtained from input images w.r.t. LS samples drawn from p(z|y): the accuracy on this task should be between 0.4 and 0.6. At the same time, we check if the discriminator’s accuracy in assigning a higher probability to the correct label of nominal samples is above 0.9. Otherwise it is discarded. Combined, we used the resulting rAAEs to draw 20,000 training and 10,000 test samples for both MNIST and FMNIST, using \(\delta _{max}=.25\) for test data and \(\delta _{max}=0.4\) for training data. We ignored generated samples where the difference between the two label’s probabilities was above \(\delta _{max}\). We chose different \(\delta _{max}\), (the loose upper threshold of difference in the two class probabilities) for train and test set as our test set should be clearly and highly ambiguous, e.g. to allow studies that specifically target ambiguity (hence a low \(\delta _{max}\)). In turn, the training set should more continuously integrate with the nominal data, hence we also allow for less ambiguous data compared to our ambiguous test set.

6 Evaluation of Generated Data

The goal of this experimental evaluation is to assess both quantitatively and qualitatively whether AmbiGuess can indeed generate truly ambiguous data. We evaluate the ambiguity in our generated datasets first using a quantitative analysis where we analyze the outputs of a standard, well-trained classifier and second by visually inspecting and critically discussing samples created using AmbiGuess. Our evaluation is limited to simple grayscale image classification datasets, where the rAAEs are easily trained. See Section 8 for a detailed discussion of the applicability of AmbiGuess.

6.1 Quantitative Evaluation of AmbiGuess

We performed our experiments using four different DNN architectures as supervised models: A simple convolutional DNN (Chollet 2020), a similar but fully connected DNN, a model consisting of Resnet-50 (He et al. 2016) feature extraction and three fully connected layers for classification and lastly a Densenet-architecture (Huang et al. 2017). Results are averaged over the four architectures, individual results are reported in the reproduction package.

We compare the predictions made for our ambiguous dataset to the predictions made on nominal, non-ambiguous data, using the following metrics:

Top-1 / Regular Accuracy

Percentage of correctly classified inputs. We expect this to be considerably lower for ambiguous than for nominal samples, as choosing the correct (i.e., higher probability) class, even using an optimal model, is affected by chance.

Top-2 Accuracy

Percentage of inputs for which the true label is among the two classes with the highest predicted probability. For samples which are truly ambiguous between two classes, we expect a well-trained model to achieve much better performance than on Top-1 accuracy (ideally, 100%).

Top-Pair Accuracy Novel metric for data known to be ambiguous between two classes, measured as the percentage of inputs for which the two most likely predicted classes equal the two true classes between which the input is ambiguous. By definition Top-Pair accuracy is lower than or equal to Top-2 accuracy. It is an even stronger measure to show that the model is uncertain between exactly the two classes for which the true probabilistic label of the input shows nonzero probability. A specific example on how Top-Pair Accuracy is computed is provided in Appendix A.

Entropy

Average entropy in the Softmax prediction arrays. Used as a metric to measure aleatoric uncertainty (and thus ambiguity) in related work (Mukhoti et al. 2021).

In line with related literature (Mukhoti et al. 2021), we focus our evaluations on models trained using a mixed-ambiguous dataset consisting of both nominal and ambiguous data. This aims to make sure our ambiguous test sets are not OOD, and that thus the observed uncertainty primarily comes from the ambiguity in the data: By adding a lot of data similar to the (ambiguous) test set to the training set, the vast majority of our ambiguous inputs is thus expected to be in-distribution, eradicating most of the epistemic uncertainty. The aleatoric uncertainty caused by the ambiguity of the data is however still there. For completeness, we also run the evaluation on a model trained using only nominal data. With this model we expect even lower values of regular (top-1) accuracy on ambiguous data, as these are out-of-distribution, not just ambiguous.

Table 1 Evaluation of Ambiguity

6.2 Quantitative Results

The results of our experiments, averaged over all tested model architectures, are shown in Table 1. Results individually reported for each architecture are shown in Appendix B. We noticed that the use of the mixed-ambiguous training sets reduces the model accuracy on nominal data only by a negligible amount: On MNIST, the corresponding accuracy is 96.98% (97.42% using a clean training set) and 88.43% on FMNIST (88.37% using a clean training set). Thus, our ambiguous training datasets can be added to the nominal ones without hesitation.

Results indicate that our datasets are indeed suitable to induce ambiguity into the prediction process, as the generated data is perceived as ambiguous by the DNN: Top-1 accuracies for both case studies is around 50%, but they increase almost to the levels of the nominal test set when considering Top-2 accuracies. Even Top-Pair accuracy, with values of 95.37% and 86.71% (on MNIST and FMNIST, respectively) are very high, showing that for the vast majority of test inputs, the two classes considered most likely by the well-trained DNN are exactly the classes between which we aimed to create ambiguity. Consistently, entropy is substantially higher for ambiguous data than for nominal data.

Finally, we compared our ambiguous MNIST dataset against AmbiguousMNIST by Mukhoti et al. (2021), the only publicly available dataset aiming to provide ambiguous data. Results are clearly in favour of our dataset. Considering the models with mixed-ambiguous training sets,Footnote 3,Footnote 4 our test dataset has a lower Top-1 accuracy (53.31% vs. 72.50%), indicating that our dataset is harder (more ambiguous) and has a higher Top-2 accuracy (97.99% vs. 90.93%) showing that our dataset contains more samples whose predicted class is amongst the 2 most likely labels. Top-Pair accuracy cannot be computed for AmbiguousMNIST, as 37% of its claimed “ambiguous” inputs have non-ambiguous labels. Most strikingly, the average softmax entropy for AmbiguousMNIST is 0.88 (ours: 1.22), even though AmbiguousMNIST is created by actively selecting inputs with a high softmax entropy.

Fig. 8
figure 8

Selected good and bad outputs of AmbiGuess, chosen to demonstrate strengths and weaknesses

6.3 Qualitative Discussion of AmbiGuess

Some test samples generated using AmbiGuess, for both MNIST and FMNIST, are shown in Fig. 8. They have been chosen to highlight different strengths and weaknesses that emerged during our qualitative manual review of 300 randomly selected images in our generated test sets per case study.

MNIST

AmbiGuess (see Fig. 8a-e) is in general capable of combining features of different classes, where possible: Fig. 8a and c can both be seen as an 8, but the 8-shape was combined with a 3-shape or 2-shape, respectively. For the combination between 0 and 7, shown in Fig. 8b, only the upper (horizontal) part of the 7 was combined with the 0-shape, such that both a 7 and a 0 are clearly visible, making the class of the image ambiguous. Figure 8d shows an edge case of an almost invalid image: Knowing that the image is supposed to be ambiguous between 1 and 4, one can identify both numbers. However, neither of them is clearly visible and the image may appear invalid to some humans. Overall, we considered only few samples generated by AmbiGuess for MNIST as bad, i.e., as clearly unambiguous or invalid. An example of them is shown in Fig. 8e. By most humans, this image would be recognized as an unambiguous 0. In fact, there’s a barely visible, tilted line within the 0 which apparently was sufficient to trick the rAAEs discriminator into also assigning a high probability to digit 1.

FMNIST

Realistic true ambiguity is not possible between most classes of FMNIST. Hence, we assessed how well AmbiGuess performs at creating data that would trigger an ambiguous classification by humans, even though such data might be impossible to experience in the real world. Examples are given in Fig. 8f-j. In most cases (e.g. Fig. 8f-h), the interpolations created by AmbiGuess show an overlay of two items of the two considered classes, with features combined only where possible. We can also observe that some non-common features are removed, giving more weight to common features. For instance, in Fig. 8i, the tip of the shoe, and the lower angles of the bag are barely noticeable, such that the image has indeed high similarity with both shoes and bags. As a negative example we observe that, in some cases, it appears that the overlay between the two considered classes is dominated by one one of them (such as Fig. 8j, which would be seen as non-ambiguous ankle boot by most humans).

figure d

7 Testing of Supervisors

We assess the capability of 16 supervisorsFootnote 5 to discriminate nominal from high-uncertainty inputs for MNIST and FMNIST, each on 4 distinct test sets representing different root causes of mis-classifications, among which our ambiguous test set.

7.1 Experimental Setup

We performed our experiments using four different DNN architectures (explained in Section 6.1) as supervised models. Our training sets consist of both nominal and ambiguous data, to ensure that the ambiguous test data used later for testing is in-distribution. We then measure the capability of different supervisors to discriminate different types of high-uncertainty inputs from nominal data. We measure this using the area under the receiver operating characteristic curve (AUC-ROC), a standard, threshold-independent metric.

We assess the supervisors using the following test sets: Invalid test sets, where we use MNIST images as inputs to models trained for FMNIST and vice-versa, corrupted test sets available from related work (MNIST-C (Mu and Gilmer 2019) and FMNIST-c (Weiss and Tonella 2022)), adversarial data, created using 4 different attacks (Madry et al. 2017; Kurakin et al. 2018; Moosavi-Dezfooli et al. 2016; Goodfellow et al. 2014) and lastly the ambiguous test sets generated by AmbiGuess. Adversarial test sets were not used with ensembles, as an ensemble does not rely on the (single) model targeted by the considered adversarial test generation techniques.

To account for random influences during training, such as initial model weights, we ran the experiments for each DNN architecture 5 times. Results reported are the means of the observed results.

7.2 Tested Supervisors

To avoid unnecessary redundancy, our description of the tested supervisors is brief and we refer to the corresponding papers for a detailed presentation. Our terminology, implementation and configuration of the first three supervisors described below, i.e., Softmax, MC-Dropout and Ensembles, are based on the material released in our recent empirical studies (Weiss and Tonella 2021, 2022).

Plain Softmax

Based solely on the softmax output array of a DNN prediction, these approaches provide very fast and easy to compute supervision: Max. Softmax, highest softmax value as confidence (Hendrycks and Gimpel 2016), Prediction-Confidence Score (PCS), the difference between the two highest softmax values (Zhang et al. 2020), DeepGini, the complement of the softmax vector squared norm (Feng et al. 2020), and finally the entropy of the values in the predicted softmax probabilities (Weiss and Tonella 2021).

Monte-Carlo Dropout (MC-Dropout)

Gal and Ghahramani (2016); Gal (2016a)Enabling the randomness of dropout layers at prediction time, and sampling multiple randomized samples allows the inference of an output distribution, hence of an uncertainty quantification. We use the quantifiers Variation Ratio (VR), Mutual Information (MI), Predictive Entropy (PI), or simply the highest value of the mean of the predicted softmax likelihoods (Mean-Softmax, MS).

Ensembles

Lakshminarayanan et al. (2017) Similar to MC-Dropout, uncertainty is inferred from samples, but randomness is induced by training multiple models (under random influences such as initial weights) and collecting predictions from all of them. Here, we use the quantifiers MI, PI and MS, on an Ensemble consisting of 20 models.

Dissector

Wang et al. (2020) On a trained model, for each layer, a submodel (more specifically, a perceptron) is trained, predicting the label directly from the activations of the given layer. From these outputs, the support value for each of the submodels for the prediction made by the final layer is calculated, and the overall prediction validity value is calculated as a weighted average of the per-layer support values.

Table 2 Supervisors performance at discriminating nominal from high-uncertainty inputs (AUC-ROC), averaged over all architectures

Autoencoders

AEs can be used as OOD detectors: If the reconstruction error of a well-trained AE for a given input is high, it is likely not to be sufficiently represented in the training data. Stocco et al. (2020) proposed to use such OOD detection technique as DNN supervisor. Based on their findings, we use a variational autoencoder (Kingma and Welling 2013).

Surprise Adequacy

This approach detects inputs that are surprising, i.e., for which the observed DNN activation pattern is OOD w.r.t. the ones observed on the training data.

We consider three techniques to quantify surprise adequacy: LSA Kim et al. (2018), where surprise is calculated based on a kernel-density estimator fitted on the training activations of the predicted class, MDSA (Kim et al. 2020), where surprise is calculated based on the Mahalanobis distance between the tested input’s activations and the training activations of the predicted class, and DSA (Kim et al. 2018) which is calculated as the ratio between two Euclidean distances: the distance between the tested input and the closest training set activation in the predicted class, and the distance between the latter activation and the closest training set activation from another class. As DSA is computationally intensive, growing linearly in the number of training samples, we follow a recent proposal to consider only 30% of the training data (Weiss et al. 2021).

Our comparison includes most of the popular supervisors used in recent software engineering literature. Some of the excluded techniques do not provide a single, continuous uncertainty score and no AUC-ROC can thus be calculated for them (Catak et al. 2021; Mukhoti et al. 2021; Postels et al. 2020), or they are not applicable to the image classification domain (Hussain et al. 2022). With its 16 tested supervisors, two case studies and four different data-centric root causes of DNN faults, our study is \({-}\) to the best of our knowledge \({-}\) by far the most extensive of its kind.

7.3 Results

Overall observed results (averaged over all models) are are presented in Table 2. Per-Architecture results with the corresponding standard deviations are shown in Appendix C.

Ambiguous Data

We can observe that the predicted softmax likelihoods capture aleatoric uncertainty pretty well. Thus, not only do Max. Softmax, DeepGini, PCS, Softmax Entropy perform well at discriminating ambiguous from nominal data, but also supervisors that rely on the softmax predictions indirectly, such as Dissector, or the MS, MI and PE quantifiers on samples collected using MC-Dropout or DeepEnsembles. DSA, LSA, MDSA and Autoencoders are not capable of detecting ambiguity, and barely any of their AUC-ROCs exceeds the 0.5 value expected from a random classifier on a balanced dataset. MDSA, LSA and DSA show particularly low values, which confirms that they do only one job \({-}\) detecting OOD, not ambiguous data \({-}\) but they do it well (in our experimental design, ambiguous data is in-distribution by construction, while adversarial, corrupted and invalid data is OOD).

Adversarial Data

The surprise adequacy based supervisors and the autoencoder reliably detected the unknown patterns in the input, discriminating adversarial from nominal data. Softmax-based supervisors showed good results on MNIST, but less so on FMNIST. Cleary, the adversarial sample detection capabilities of Softmax-based supervisors depend critically on the choice of adversarial data: With minimal perturbations, just strong enough to trigger a misclassification, softmax-based metrics can easily detect them, as the maximum of the predicted softmax likelihood is artificially reduced by the adversarial technique being used. However, one could apply stronger attacks, increasing the predicted likelihood of the wrong class close to 100%, which would make Softmax-based supervisors ineffective. Specific attacks against the other supervisors, i.e., the OOD detection based approaches (surprise adequacies and autoencoders) and Dissector, might also be possible in theory, but they are clearly much harder.

Corrupted Data

Most approaches perform comparably well, with the exception of DSA on FMNIST, which shows superior performance, with an average AUC-ROC value more than .1 higher than most other supervisors. DNNs are known to sometimes map OOD data points close to feature representations of in-distribution points (known as feature collapse) (van Amersfoort et al. 2021), thus leading to softmax output distributions similar to the ones of in-distribution images. This impacts negatively the OOD detection capability of Softmax-based supervisors (such as Max. Softmax, MC-Dropout, Ensembles or Dissector), especially in cases with a feature-rich training set, such as FMNIST.

Invalid Data

The best result in invalidity detection was achieved using the autoencoder’s reconstruction error, which identified FMNIST inputs given to an MNIST classifier with an AUC-ROC of \(\approx 1.00\) (thus with almost perfect accuracy). Clearly, reconstructions of images with higher feature complexity than the ones represented in the training set consistently lead to high reconstruction errors and thus provided a very reliable outlier detection. The autoencoder was however incapable of detecting MNIST images given to an FMNIST classifier (AUC-ROC of \(\approx 0.49\), similar to a random classifier). Here, it seems that an autoencoder trained on a high feature-complexity training set would also learn to reconstruct low feature-complexity inputs accurately. DSA and MDSA, showed a similar effect, also providing clearly inferior results in the FMNIST case study compared to MNIST, although here the drop in performance was less dramatic. Also, similar to corrupted data, most likely due to feature collapse, the performance of supervisors relying on softmax likelihoods suffers dramatically.

Discussion

Related literature suggests that no single supervisor performs well under all conditions (Zhang et al. 2018; Weiss and Tonella 2021, 2022; Catak et al. 2021), and some works even suggest that certain supervisors are not capable to detect anything but aleatoric uncertainty (e.g. Softmax Entropy (Mukhoti et al. 2021) or MC-Dropout (Osband 2016)). Our evaluation, to the best of our knowledge, is the first one which compares supervisors on four different uncertainty-inducing test sets. We found that softmax-based approaches (including MC-Dropout, Ensembles and Dissector) are effective on all four types of test sets, i.e., their detection capabilities reliably exceed the performance expected from a random classifier. They do have their primary strength in the detection of ambiguous data, where the other, OOD focused techniques are naturally ineffective, but they are actually an inferior choice when targeting epistemic uncertainty. To detect corrupted inputs, DSA exhibited the best performance, but due to its high computational complexity it may not be suitable to all domains. The much faster MDSA may offer a good trade-off between detection capability and runtime complexity. Regarding invalid inputs, on low-feature problems, where invalid samples are expected to be more complex than nominal inputs, AEs provide a fast approach with the additional advantage that it does not rely on the supervised model directly, but only on its training set, which may facilitate maintenance and continuous development. For problems where the nominal inputs are rich of diverse features, an AE is not a valid option. However, our results again suggest MDSA as a reliable and fast alternative supervisor. For what regards inputs created by adversarial attacks, softmax-based approaches are easily deceived, being hence of limited practical utility. On the other hand, OOD detectors, such as surprise adequacy metrics and AEs, or Dissector can provide a more reliable detection performance against standard adversarial attacks. Of course, these supervisors are not immune from particularly malicious attackers that target them specifically. Here, the reader can refer to the wide range of research discussing defenses against adversarial attacks (survey provided by Akhtar et al. (2021)).

Stability of results

We found that our results are barely sensitive to random influences due to training: Out of 488 reported mean AUC-ROCs (4 architectures, 8 test sets, 16 MPs, averaged over 5 runsFootnote 6) most of them showed a negligible standard deviation: The average observed standard deviation was 0.015, the highest one was 0.124, only 114 were larger than 0.02, only 29 were larger than 0.05, all of which correspond to results with low mean AUC-ROC (< 0.9). The latter differences do not influence the overall observed tendencies.

figure e

8 Threats to Validity

External validity

We conducted our study on misclassification predition considering two standard case studies, MNIST and Fashion-MNIST. While our observations may not generalize to more challenging, high uncertainty datasets, the choice of two simple datasets with easily understandable features, allowed us to achieve a clear and sharp separation of the reasons for failures, which may not be the case when dealing with more complex datasets. On the other hand, we recognize the importance of replicating and extending this study considering additional datasets. To support such replications we provide all our experimental material as open source/data.

Internal validity

The supervisors being compared include hyper-parameters that require some tuning. Whenever possible, we reused the original values and followed the guidelines proposed by the authors of the considered approaches. We also conducted a few preliminary experiments to validate and fine tune such hyper-parameters. However, the configurations used in our experiment could be suboptimal for some supervisor.

Conclusion validity

We repeated our experiments 5 times to mitigate the non determinism associated with the DNN training process. While this might look like a low number of repetitions, we checked the standard deviation across such repetitions and found that it was negligible or small in all cases. To amount for the influence of the DNN architecture, we performed our experiments on 4 completely different DNN architectures, obtaining overall consistent findings.

9 Conclusion

This paper brings two major advances to the field of DNN supervision testing: First, we proposed AmbiGuess, a novel technique to create labeled ambiguous images in a way that is independent of the tested model and of its supervisor, and we generated pre-compiled ambiguous datasets for two of the most popular case studies in DNN testing research, MNIST and Fashion-MNIST.

Using four different metrics, we were able to verify the validity and ambiguity of our datasets, and we further investigated how AmbiGuess achieves ambiguity based on a qualitative analysis. On the four considered quantitative indicators, AmbiGuess clearly outperformed AmbiguousMNIST, the only similar-purposed dataset in the literature.

We assessed the capabilities of 16 DNN supervisors at discriminating nominal from ambiguous, adversarial, corrupted and invalid inputs. To the best of our knowledge, this is not only the largest empirical case study comparing DNN supervisors in the literature, it is also the first one to do so by specifically targeting four distinct and clearly separable data-centric root causes of DNN faults. Our results show that softmax-based approaches (including MC-Dropout and Ensembles) work very well at detecting ambiguity, but have clear disadvantages when it comes to adversarial, corrupted, and invalid inputs. OOD detection techniques, such as surprise adequacy or autoencoder-based supervisors, often provide a better detection performance with the targeted types of high-uncertainty inputs. However, these approaches are incapable of detecting in-distribution ambiguous inputs.

DNN developers can use the ambiguous datasets created by AmbiGuess to assess novel DNN supervisors on their capability to detect aleatoric uncertainty. They can also use our tool to evaluate test prioritization approaches on their capability to prioritize ambiguous inputs (depending on the developers’ objectives, high priority is desired to identify inputs that are likely to be misclassified during testing; low priority is desired to exclude inputs with probabilistic labels from the training set).

As future work, we plan to investigate the concept of true ambiguity for regression problems. This is relevant in domains, such as self-driving cars and robotics, where the DNN output is a continuous signal for an actuator. This problem is particularly appealing as all the approaches in our study that worked well at detecting ambiguity are based on softmax and thus are not applicable to regression problems.

Additionally, a comprehensive human experiment evaluating and comparing the ambiguity of data in nominal datasets, data created using AmbiGuess, and data generated by other approaches would help to better understand the nature of these datasets.