Assessing Systematic Weaknesses of DNNs using Counterfactuals

With the advancement of DNNs into safety-critical applications, testing approaches for such models have gained more attention. A current direction is the search for and identification of systematic weaknesses that put safety assumptions based on average performance values at risk. Such weaknesses can take on the form of (semantically coherent) subsets or areas in the input space where a DNN performs systematically worse than its expected average. However, it is non-trivial to attribute the reason for such observed low performances to the specific semantic features that describe the subset. For instance, inhomogeneities within the data w.r.t. other (non-considered) attributes might distort results. However, taking into account all (available) attributes and their interaction is often computationally highly expensive. Inspired by counterfactual explanations, we propose an effective and computationally cheap algorithm to validate the semantic attribution of existing subsets, i.e., to check whether the identified attribute is likely to have caused the degraded performance. We demonstrate this approach on an example from the autonomous driving domain using highly annotated simulated data, where we show for a semantic segmentation model that (i) performance differences among the different pedestrian assets exist, but (ii) only in some cases is the asset type itself the reason for this reduction in the performance.


Introduction
Recently, there has been great interest in deploying deep neural networks (DNNs) for computer vision tasks in safetycritical applications like autonomous driving (Siam et al. 2018) or medical diagnostics (Hesamian et al. 2019).However, rigorous testing for verification and validation (V&V) of DNNs is still an open problem.Without sufficient V&V, using DNNs for such safety-critical applications can lead to dangerous situations.While a large body of work has focused on improving robustness (Hendrycks et al. 2019), defending against adversarial attacks (Akhtar and Mian 2018;Chakraborty et al. 2018), and improving domain generalization (Oza et al. 2021;Wang and Deng 2018), only recently, a few works have focused on identifying systematic weaknesses of DNNs (d'Eon et al. 2022;Eyuboglu et al. 2022;Gannamaneni, Houben, and Akila 2021;Lyssenko et al. 2021;Syed Sha, Grau, and Hagn 2020).Such systematic weaknesses, however, pose a significant challenge from a safety perspective, as seen by several real-world examples (Buolamwini and Gebru 2018;De Vries et al. 2019), where DNNs have been shown to perform worse for certain subsets of data.Such effects have been extensively studied in the context of Fairness, see, e.g., (Wang et al. 2020), where often bias due to skin color or gender is addressed.However, this is not sufficient to investigate the safety of a DNN.Instead, all dimensions that could potentially influence performance have to be taken into account.
This can be seen in the broader context of trustworthiness assessments for ML, specifically regarding reliability.Multiple high-level requirements or upcoming standards point out the issues of data completeness or coverage to varying levels of detail.For instance, (High-Level Expert Group on AI (AI HLEG) 2019) discusses both data completeness in the context of Fairness (as presented above) but also states that the robustness of the application is bound to a specific application domain.Other approaches, such as (Loh et al. 2022), see a rigorous definition of the application's input space and aligning the used data to it as an important aspect of trustworthiness.Likely, such approaches will reside on semantic, human-understandable definitions of data dimensions to define the domain and, potentially, also to demonstrate coverage.
An informal example to illustrate the concept of data specification and its relation to potential systematic weaknesses could be a classifier that distinguishes between cats and dogs.While it would be a common procedure to investigate the performance of such a classifier by, e.g., a confusion matrix between the two classes, one could (and for critical algorithms) should extend both the task description and the investigation to include sub-classes.For instance, one could evaluate the classifier's performance on the various dog breeds the classifier was built for (definition of the input space) and ascertain whether each of these breeds is recognized with sufficient performance.To achieve the latter, data, in most cases, must come with sufficiently detailed attribution.Obtaining such meta-information for unstructured data, e.g., images as used for object detection is a challenging objective in its own right.Recent works, see, for instance, (d'Eon et al. 2022;Eyuboglu et al. 2022), use dedicated neural networks or parts of the investigated networks to obtain information on the data while other approaches, such as ours (Gannamaneni, Houben, and Akila 2021) or, e.g., (Fingscheidt, Gottschalk, and Houben 2022), focus on a controlled data generation process.For example, by using synthetic data, which yields detailed data description as a "by-product".The latter, more labeling-oriented approaches, have the advantage that obtained information is aligned with the semantic dimensions of the data.For the former approaches, outcomes are often raw data subsets whose descriptive attributes have to be uncovered in an additional step.
The analysis of identified (potential) weaknesses is often more complex as multiple data dimensions can influence the outcome simultaneously.For instance, in the above example, is the dog breed a performance limiting factor, or can the weakness be attributed to another factor, either in a causal (e.g., size of the breed) or only correlated fashion (e.g., if many images of one dog breed were taken by a different camera model)?Answering such questions requires a detailed (meta-)description of the data, which is challenging outside of synthetic data and requires consideration of the impact of multiple dimensions and their interactions.The latter point often leads to a "combinatorial explosion" as soon as two, and higher order interactions are taken into account.At the same time, correct identification of data subsets with weak performance is important as it is otherwise hard to mitigate such systematic failure modes.For instance, the common approach for mitigation is by generating or obtaining more data samples and adding them to the training data.
As data gathering or generation is costly, it is important to understand better which data is needed to improve coverage and training.Towards this goal, we propose a computationally cheap method that helps to determine whether identified performance differences between two semantically distinguished subsets X and Y can be attributed to either the semantically distinguishing property between the sets or other described but not considered factors.For this, we adapt the concept of counterfactual explanations (Verma, Dickerson, and Hines 2020), where we pair the elements of X only to those elements in Y, which are most similar w.r.t.all other known attributes.Suppose our method determines that the distinguishing semantic property between the sets can be considered as the true cause for the performance difference.In that case, generating more data for the weak set should be a useful measure to improve the DNN performance.If not, the method allows for identifying semantic subsets of X or Y, respectively, that are likely to contain the true cause of the performance gap.

Related work
In this section, we discuss literature related to finding systematic weaknesses of machine learning models by classifying them based on the input data type.We first present approaches working with structured, e.g., tabular data, and then present approaches for unstructured data such as images.Drawbacks of the existing approaches are presented subsequently.Lastly, we introduce works on counterfactuals.
For structured data, SliceFinder (Chung et al. 2019) identifies weak slices (subsets) of data by ordering slices based on certain criteria like performance, effect, and slice size.These weak slices correspond to the systematic weakness of the models.Similarly, SliceLine (Sagadeeva and Boehm 2021) provides an enumeration of all slice combinations using a scoring function and different pruning methods.In addition, subgroup search (Atzmueller 2015;Herrera et al. 2011) is an extensively researched field in data mining for identifying interesting subgroups or subsets of data based on certain quality criteria.However, at higher dimensions, subgroup search methods are computationally very intensive.Complexity grows even further when going from structured to unstructured data, such as images.In these cases, spaces are intrinsically (very) high dimensional.Accordingly, there has been no widespread use of such methods for computer vision.
Due to this complexity and the difficulty of obtaining metadata for unstructured data like images, some approaches have been proposed to use intrinsic information from the DNNs.Eyuboglu et al. (2022) developed DOMINO for finding systematic weaknesses by using cross-modal representations generated from a pre-trained CLIP (Radford et al. 2021) model.However, using CLIP, a blackbox DNN model in the embedding generation process, adds further complexity to validating the DNN-under-test.Spotlight (d'Eon et al. 2022) uses representations from the final layers of DNNs to identify contiguous regions of high loss.These contiguous regions (slices) with the highest loss are then considered weak slices that could lead to potential systematic weaknesses of the DNN.Both these approaches are restricted to classification tasks.Furthermore, due to the nature of the approach, resulting weaknesses need to be interpreted, either by other ML models or humans, to derive actionable insight into its nature, i.e., to identify the common cause of all found weakly performing elements.This approach hinders a systematic evaluation and might also be prone to errors.
With a particular focus on autonomous driving, several works (Gannamaneni, Houben, and Akila 2021;Lyssenko et al. 2021;Syed Sha, Grau, and Hagn 2020) have used computer simulators to generate metadata to identify systematic weaknesses in object detection and semantic segmentation models.In our earlier work (Gannamaneni, Houben, and Akila 2021), we made modifications to the Carla simulator (Dosovitskiy et al. 2017) to generate pedestrianlevel metadata.With the generated data, we trained a DeepLabv3+ (Chen et al. 2017) model and identified several (potential) systematic weaknesses w.r.t.digital asset type and skin color.Similarly, Lyssenko et al. ( 2021) evaluated DNN performance along a semantic feature, the distance of the pedestrian to the ego-vehicle, using data generated from their modified Carla simulator and showed a linear decrease in performance with increasing distance.Syed Sha, Grau, and Hagn (2020) proposed a validation engine, 'VALERIE', and evaluated the performance of two different DNNs w.r.t.metadata attributes like pixel-area, occlusion-rate, and distance of pedestrians.All those works circumvent complexity by investigating features in isolation.In this work, we, however, take all (available) features into account.
With regards to counterfactuals, Verma, Dickerson, and Hines (2020) show that a large body of work in explainable AI has proposed counterfactual methods to provide explanations of DNN behavior by investigating "what if" scenarios.Several methods (Dandl et al. 2020;Goyal et al. 2019;Kanamori et al. 2020;Mothilal, Sharma, and Tan 2020;Wachter, Mittelstadt, and Russell 2017) structure finding counterfactuals as an optimization problem using gradients of the model-under-test similar to finding adversarial attacks.However, unlike in adversarial attacks, counterfactual methods can possess several additional constraints (Verma, Dickerson, and Hines 2020) such as validity (Wachter, Mittelstadt, and Russell 2017), actionability (Karimi, Schölkopf, and Valera 2021;Ustun, Spangher, and Liu 2019), closeness to data manifold (Dhurandhar et al. 2018;Joshi et al. 2019;Mahajan, Tan, and Sharma 2019) and/or sparsity in feature changes (Guidotti et al. 2018;Karimi et al. 2020).Counterfactuals also differ from feature attribution methods like LIME (Ribeiro, Singh, and Guestrin 2016) or Shapley Values (Lundberg and Lee 2017) as counterfactuals identify new inputs which lead to change in predictions rather than attributing predictions to a set of features.
While these methods have mostly restricted themselves to tabular data and simple image datasets like MNIST (LeCun, Cortes, and Burges 2010), our method performs counterfactual evaluation on a semantic segmentation dataset for autonomous driving using metadata descriptions of pedestrians in the images.In addition, the goal of most counterfactual methods, as they concern fairness, is to (actively) change a decision or output, while our focus is on investigating performance differences.Furthermore, the gradient-based approaches require additional inference steps to investigate their (newly created) counterfactual sample.Our method instead uses a statistical formulation of counterfactuals (Pearl 2019) as for unstructured data creating new "what if" scenarios is (computationally) too expensive, even when using simulators, to afford a meaningful analysis over many data points.

Method
In this section, we provide a general description of our counterfactual algorithm and how it may be used to investigate performance differences within a given dataset D. As our approach is general, we relegate instantiating D (and its properties) to the next section.Here, it is sufficient that all elements of D can be seen as individual inputs or points of interest (later: those will be separate pedestrians, and the task will be their recognition), for each of which we can obtain a respective performance value (later: intersection-over-union (IoU)) and a semantic description based on multiple dimensions S = {s 1 , . . ., s n } (later: for instance, distance from the vehicle or asset membership).
As motivated, we are interested in identifying DNN weaknesses in terms of data subsets that perform weakly due to properties inherent to the data, e.g., occlusion or underrepresented input types.Furthermore, we want to validate whether the decreased performance can be attributed to the identified property.
For this, let us consider slices X , Y ⊆ D of the data where variances in performance exist between them.Such slices can be obtained, e.g., by selecting all elements which have one (or multiple) fixed semantic properties (or ranges thereof) in common, e.g., X = {x | x ∈ D ∧ s asset (x) = asset 1 }.For this notation, we assume that P (x) and P (X ) are the performance value of the element x ∈ X or the set of performance values over all of X , respectively.1This allows us to write the "conventional" performance difference between the two slices as where µ denotes the mean-value of the respective set.We may also formulate a related quantity based on the set of local performance differences where i provides a fixed but arbitrary index of the sets X , Y respectively.If we were to average over all possible such indices, we would find that Looking at eq. ( 2), we observe that differences disregard the semantic contexts of the pairings x i , y i and could compare datapoints with entirely different properties (e.g., pedestrians of high occlusion with clearly visible ones).We, therefore, build a dedicated paired dataset C cf (X , Y), where for each element of the reference dataset X we find the most similar (by semantic description) datapoint in the other dataset Y as shown in equation ( 4), where the counterfactual datapoints in Y are selected by where y j ∈ Y\{x i } (5) which minimizes the distance dist d using a subset d ⊆ S of the semantic features S.3 Refer to the next section for a concrete example of our metric.Based on this paired set we can calculate the set of counterfactual differences taking into account only those pairings that are closer than a threshold τ to ensure sufficiently close proximity of points.The corresponding average performance difference is given Please note that, importantly, the counterfactual difference ∆ cf does not have to coincide with ∆ con .If, for instance, |∆ cf | ≪ |∆ con | the performance differences between X and Y can likely not be attributed to the semantics used to separate both sets from D, but instead are a property of some other latent discrepancy between the two sets.In these cases, one might want to investigate the non-matched elements, i.e., those elements of X , Y that are not part of C cf , as they might carry another distinguishing attribute.Please note that the matching procedure of eq. ( 5) is, in most cases, not commutative between the sets.However, when X , Y do not intersect our fixed order of operations ensures ∆ cf (X , Y) = −∆ cf (Y, X ).Conversely, if X = Y, our definition avoids collapse as the points cannot be matched onto themselves.A more procedural definition to determine the counterfactual difference ∆ cf can be found in algorithm 1 below.
▷ Yields mean perf.difference else return Null ▷ Sets could not be matched for selected max.distance of τ

Experimental Setup
In this section, we describe the used dataset, the DNNunder-test and the concrete implementation of the counterfactual calculations and their metrics.
Dataset -For our experiments, we use the dataset generated from our previous work, Gannamaneni, Houben, and Akila (2021), which contains extracted meta-information4 for each pedestrian visible within an image.It was generated using Carla Simulator v0.9.11 (Dosovitskiy et al. 2017) and contains 23 classes following a mapping similar to Cityscapes (Cordts et al. 2016).It consists of images with traffic scenes, corresponding semantic segmentation ground truth, and pedestrian meta-information.This way, and using additional computer vision postprocessing, we obtain the set S of our semantic features encompassing {dist, visibility, num pixels, size, asset id, x min , y min , x max , y max }.Here, dist refers to the euclidean distance of the pedestrian from the ego vehicle, visibility to the percentage of the pedestrian that is unoccluded, num pixels to the (absolute) number of pixels belonging to the pedestrian, asset id gives an identifier for the 3D model used by the simulator.The coordinates x, y belong to the bounding box of the pedestrian, and size provides its respective area.With this setup, we generated 7 394 images (all from the "Town02" map in Carla) and trained the semantic segmentation model DeepLabv3+ (Chen et al. 2017) on them.We follow the same training setup and data pre-processing as our prior work (Gannamaneni, Houben, and Akila 2021).To investigate the performance of the said model, we added the individually achieved IoU (Intersection-over-Union) to the collection of our per-pedestrian metadata.In the remainder, all analysis will be performed on this resulting table of performance and meta-data descriptions containing a total of 24 424 entries.5

Implementation of Counterfactual Similarity -
To build counterfactual datasets as given by eq. ( 4), we need to specify a distance among the semantic dimensions.For this, we use the euclidean metric, where we used onehot-encoding for the categorical asset id and re-scaled all numerical dimensions to the unit range [0, 1].If not stated otherwise, the cut-off parameter τ , see eq. ( 6), is chosen as 0.2.In the cases where X and Y do not intersect, we can calculate counterfactual pairings (more) efficiently using a balltree algorithm to find the nearest points.More concretely, the problem can be seen as a k-NN classification for k = 1, where we are interested only in the performance of the nearest point in Y w.r.t. a sample from X .

Results
We conducted three different experiments.First, we evaluate the expressive power of semantic features in the similarity search.These results provide insight into the noise level of the performance values and to which degree they depend on the known semantic attributes.In the second subsection, the results of the counterfactual analysis for a semanticdimension-under-test are provided and discussed.Finally, using results from the counterfactual analysis, we show that interesting subsets of the slices can be discovered, which can be used by the developer to further narrow down which data dimensions to investigate.

Evaluating expressive power of semantic features
Our counterfactual analysis assumes that the semantic features are expressive enough to provide on average some indication on the performance of the DNN-under-test, i.e., that more similar points are more likely to have a similar performance.We perform the following check to validate this assumption: Setting X = Y = D, we can evaluate the statistics of A ∞ cf (D, D), see eq. ( 6), while including an increasing number of semantic dimensions into d, the special case of d = {} corresponds to a purely random pairing of the elements.To provide a baseline, assuming a flat i.i.d.distribution of the performance values in the full range of [0, 1] A cf would follow a triangular distribution with element-wise differences ranging from −1 to +1.The standard deviation of such a triangular distribution would be σ tri = 1/ √ 6 ≈ 0.41.

Any decrease of σ[A ∞
cf ] from this threshold indicates a deviation from pure randomness.We provide an overview of the development of σ on the r.h.s. of Figure 1 for increasing numbers of used semantic properties.The random pairing (|d| = 0) is still close to the threshold with σ[A rnd ] ≈ 0.35.Taking a single feature into account (|d| = 1) already leads to a decrease of (depending on the feature selected) up to 13%, while including multiple features leads to stronger but saturating decay. 6The l.h.s. of Figure 1 provides a graphical representation of this statement by showing histograms of A ∞ cf for some selected d.We can interpret the decay of the standard deviation further by comparison to a toy experiment.For this, we consider two uniform distributions, each with standard deviation σ sing , that are shifted against one another by an offset ∆µ.We can see the membership of an element to either of these distributions as a semantic property of this toy example.If we are unaware of it and investigate elements that are equally likely to be drawn from either of the distributions, we will naturally observe a larger spread of σ both .If we identify this spread with the standard deviation we have seen for the case of |d| = 0 and likewise use the |d| = 5 case as value for σ sing , we can estimate (in a rough fashion) the scale of potential shifts ∆µ.For this, we use which holds for the toy model only.It, however, suggests a value of ∆µ ≈ 0.2 as an approximate range.

Counterfactual analysis
Having established that using all five semantic features is beneficial in the similarity search, we focus on the left-out asset id to identify if there are indeed systematic weaknesses present for slices in this semantic feature (semanticfeature-under-test).To identify interesting (X , Y) combinations, we make use of the average performance of the assets for the entire training data as shown on the left in Figure 2.
We consider the highest performing assets (26,24,4,8) and the lowest performing assets (9, 23) as candidates for counterfactual analysis.The results for different combinations of these assets are shown in Table 1.For each combination, we provide the conventional difference in performance, ∆ con see eq. ( 1), the counterfactual difference in performance, ∆ cf eq. ( 5), and the random pairing difference, ∆ rnd eq. ( 2).The latter we provide as a sanity check and find that it, as expected according to eq. ( 3), approximately follows the conventional difference.7 Additionally, observed performance differences do roughly abide, in their maximal values, to the scale of ∆µ from the previous section.However, when comparing counterfactual and conventional performance differences, we, in some cases find strong discrepancies.In the case of the weakly performing asset 9, the counterfactual difference to all strong performing assets is negligible, suggesting that the decreased performance of this asset is due to the presence of pedestrians that are challenging due to some properties other than their membership to asset 9, i.e., this asset (in itself) does not constitute a systematic weakness.Hence, just generating more (training) data for asset 9 would be an ineffective way to increase its performance.Yet, the circumstances leading to the decreased performance in the slice, even if unrelated to the asset type, should be investigated further; see the results from the next experiment.In contrast to asset 9, for asset 23 (the other weak candidate), the counterfactual differences rather emphasize the performance discrepancy to all well-performing assets making it a likely candidate for an actual systematic weakness.Therefore, generating more data for asset 23 could be useful.This is also supported by investigating the number of samples of the different assets in the training data and the average performance per asset as shown on the right in Figure 2. We can see that assets 4, 9, and 24 have a similar amount of samples.However, 23 only has half the numbers compared to 26 (the asset with the highest performance).

Discovering residual subsets
As discussed above, asset 9 is underperforming, but asset membership does not seem to constitute the true cause of the issue.Hence, just generating more data for this asset is unlikely to resolve the problem efficiently, which is also supported by the fact that there is already a relatively high number of data available for asset 9 (see right-hand side of  Figure 2).For example, looking at asset 8, we see that it has much less data, but significantly higher performance than asset 9.In the following, we intend to identify, purely based on the semantic description, the data causing this effect.When building counterfactual pairs among the samples from X , Y, only a subset of the larger set (here w.l.o.g.named) Y might be used.
In such a case, this provides a way to slice Y into two subsets, i.e., a paired subset and a residual set of samples that are never used for pairing, Intuitively, we expect that the average performance of the paired subset should be closer to the one of the reference set, µ[P (X )], while the residual set shifts in the opposite direction. 8e demonstrate this approach by contrasting the weakly performing asset 9 with the four highest-performing assets.In the three cases where those asset sets contain more samples, we use the samples from asset 9 to split those sets into paired and residual sets, respectively.As seen from Table 2 the paired sets have a performance that is comparable to the one of 9.The other way around, the residual sets are performing better than the slices as a whole.Using the smaller but equally well-performing set of asset 8, we can also split the slice of asset 9 into two, of which the paired one shows high performance.Importantly, although these sub-divisions have a strong impact on the observed performance, they are

Conclusion
In this work, we have motivated a more detailed investigation of performance differences to identify systematic weaknesses in DNNs.This analysis is based on slicing, where semantic features of the data are used to form meaningful subsets.Within our approach, these features stem from the data generation process but, in theory, can also be be obtained through other methods.It is, however, impossible to attribute all potential performance influencing dimensions, which forms a limitation of many approaches that aim to find weak slices as non-annotated or non-discovered characteristics can contain unresolved weaknesses.Nonetheless, even with a limited amount of information, it is computationally challenging to identify semantic weaknesses correctly as accounting for the interaction between them leads to a "combinatorial explosion."However, it is also not straightforward to simply attribute performance loss to specific semantic features when studying them in isolation.This is caused, for instance, by inhomogeneities and sparseness of the data in high dimensions.For this reason, we propose a method inspired by counterfactual explanations where we identify neighboring points between two slices of data based on their semantic similarity.As these points are neighbors in the (n − 1)-dimensional feature space, i.e., all features except the semantic-feature-under-test that defines the respective slices, the influence of these additional features is reduced.This allows us to study whether the semantic-feature-under-test constitutes an actual weakness of the model while avoiding prohibitive computational costs of considering the interactions of all features.In our experiment from the autonomous driving domain, we could thus show that when considering the pedestrian asset type as semantic-dimension-under-test, from the two weakest performing assets, only in one case, the asset membership is the likely factor for the degraded performance.Such insight is valuable for further improvement of the DNN, as the generation of additional (training) data to mitigate weaknesses becomes costly for complex tasks such as object detection or semantic segmentation.In a second experiment, as an extension to the one before, we investigate one asset in further detail, where the asset itself was not the cause of the performance degradation.Here, we demonstrate that counterfactual matching can be used to sub-partition existing slices based on their semantic features such that the weakly performing subset is carved out.This allows a more refined analysis and can help find the actually relevant semantic dimensions among the remaining n − 1 features more easily, given that the sub-partition forms a smaller dataset.

Figure 1 :
Figure1: Left: The histogram depicting the difference in performance for samples between two datasets using three matching techniques: using a single feature with nearest neighbor matching (red), using five features with nearest neighbor matching (green), and random matching (blue), (Best seen in color).Right: The reduction in the standard deviation of the performance difference distributions as more semantic features are used in the NN search.

Figure 2 :
Figure 2: Left: Comparison of the average performance of the different digital assets.The numbers on the y-axis represent the unique asset IDs.Right: Comparison of the average performance of the different assets and the count of samples in the training data.The numbers next to a point represent the asset ID.

Table 1 :
For different X , Y combinations, the difference in mean performance using the conventional, random, and nearest neighbor pairing is shown.For convenience, Y denotes the weaker performing set.Asset slices (X , Y) ∆ con (X , Y) ∆ cf (X , Y) ∆ rnd (X , Y)

Table 2 :
The average performance of residual and nearest neighbor subsets of Y.