Evaluating the Stability of Semantic Concept Representations in CNNs for Robust Explainability

Analysis of how semantic concepts are represented within Convolutional Neural Networks (CNNs) is a widely used approach in Explainable Artificial Intelligence (XAI) for interpreting CNNs. A motivation is the need for transparency in safety-critical AI-based systems, as mandated in various domains like automated driving. However, to use the concept representations for safety-relevant purposes, like inspection or error retrieval, these must be of high quality and, in particular, stable. This paper focuses on two stability goals when working with concept representations in computer vision CNNs: stability of concept retrieval and of concept attribution. The guiding use-case is a post-hoc explainability framework for object detection (OD) CNNs, towards which existing concept analysis (CA) methods are successfully adapted. To address concept retrieval stability, we propose a novel metric that considers both concept separation and consistency, and is agnostic to layer and concept representation dimensionality. We then investigate impacts of concept abstraction level, number of concept training samples, CNN size, and concept representation dimensionality on stability. For concept attribution stability we explore the effect of gradient instability on gradient-based explainability methods. The results on various CNNs for classification and object detection yield the main findings that (1) the stability of concept retrieval can be enhanced through dimensionality reduction via data aggregation, and (2) in shallow layers where gradient instability is more pronounced, gradient smoothing techniques are advised. Finally, our approach provides valuable insights into selecting the appropriate layer and concept representation dimensionality, paving the way towards CA in safety-critical XAI applications.


Introduction
Advancements in deep learning in the last decade have led to the ubiquitous use of deep neural networks (DNNs), in particular CNNs, in computer vision (CV) applications like object detection.While they exhibit state-of-the-art performance in many fields, their decision-making logic stays opaque and unclear due to their black-box nature [4,44].This fact raises concerns about their safety and fairness, which are desirable in fields like automated driving or medicine.These demands are formalized in industrial standards or legal regulations.For example, the ISO26262 [1] automotive functional safety standard recommends manual inspectability, and the General Data Protection Regulation [13] as well as the upcoming European Union Artificial Intelligence Act [43] both demand algorithm transparency.The aforementioned concerns are subject of XAI.
XAI is a subfield of AI that focuses on revealing the inner workings of blackbox models in a way that humans can understand [37,5,27].One approach involves associating semantic concepts from natural language with internal representations in the DNN's latent space [37].In computer vision, a semantic concept refers to an attribute that can describe an image or image region in natural language (e.g., "pedestrian head", "green") [10,23].These concepts can be associated with vectors in the CNN's latent space, also known as concept activation vectors (CAVs) [23].Post-hoc CA involves acquiring and processing CAVs from trained CNNs [23,30,2], which can be used to quantify how concepts attribute to CNN outputs and apply it to verification of safety [35] or fairness [23].However, in literature two paradigms of post-hoc CA have so far been considered separately, even though they need to be combined to fully compare CNN learned concepts against prior human knowledge.These paradigms are: supervised CA, which investigates pre-defined concept representations [23,10,35], and unsupervised CA, which retrieves learned concepts [49,11] and avoids expensive labeling costs.Furthermore, current XAI approaches are primarily designed and evaluated for small classification and regression tasks [35,2], whereas more complex object detectors as used in automated driving require scalable XAI methods that can explain specific detections instead of just a single classification output.
Besides adaptation to object detection use-cases, high-stakes applications like safety-critical perception have high demands regarding the quality and reliability of verification tooling [19,Chap. 11].A particular problem is stability: One should obtain similar concept representations given the same CNN, provided concept definitions, and probing data.Instable representations that vary strongly with factors like CA initialization weights [31] or imperceptible changes of the input [40] must be identified and only very cautiously used.Stability issues may arise both in the retrieval of the concept representations, as well as in their usage.Retrieval instability was already identified as an issue in the base work [23], and may lead to concept representations of different quality or even different semantic meaning for the same concept.Instability in usage may especially occur when determining local concept-to-output attribution.In particular, the baseline approach proposed by Kim et al. [23] uses sensitivity, which is known to be brittle with respect to slight changes in the input [41,40].This work tackles the aforementioned problems of OD-ready supervised and unsupervised CA, and measurement and improvement of stability in CA retrieval and attribution.Concretely, to solve these problems, we propose an XAI framework based on supervised and unsupervised CA methods for ODs.The unsupervised method is used to automatically mine concept samples, which are jointly used for supervised concept analysis with manually labeled concepts.Furthermore, stability metrics are suggested and tested.The respective main contributions of our work are: -Proposal of two metrics and methodology for testing of concept retrieval stability and concept attribution stability in CA; -Experimental study of stability influence factors in six diverse CNN models with different backbones with the main findings that CAV dimensionality reduction may improve stability, and that gradient smoothing may be beneficial for concept attribution stability in shallow layers; -Adaptation of supervised and unsupervised concept-based analysis methods for CA on common ODs; -Introduction of a post-hoc, label-efficient, concept-based explainability framework for classifiers and ODs allowing for concept stability estimation (Fig. 1).
In the following, we will first take a look at related work on concept analysis in Sec. 2. Our approaches for combining supervised and unsupervised CA, for CA in OD, and for stability measurement are then detailed in Sec. 3. Our experimental setup can be found in Sec. 4 with results detailed in Sec. 5.

Supervised Concept Analysis
There are two primary paradigms in supervised CA methods: scalar-concept representation [25,34,6] and vector-concept representation [3,10,23].Scalar concept representations refer to disentangled deep neural network (DNN) layer representations with a one-to-one correspondence between neurons and distinct semantic concepts.A prominent example and base work are Concept Bottleneck Models [25] (CBM).These introduce an interpretable bottleneck layer to DNNs by assigning each neuron to a specific concept, i.e., scalar-concepts.An extension CBM-AUC [34], enhances the model's capability by automatically learning unsupervised concepts (AUC) that describe the residual variance of the feature space.In contrast to the previous examples, Concept Whitening [6] is a post-hoc approach towards scalar-concepts.It transforms a feature space of a layer and reduces redundancy between neurons, making it more likely for each neuron to correspond to a single concept.IIN [9] is another post-hoc approach that trains an invertible neural network to map a layer output to a disentangled version, using pairwise labels.However, standard CNNs are typically highly entangled [22].Hence, such scalar-concept approaches have to enforce the disentangled structure during training or utilize potentially non-faithful proxies [29].Furthermore, they are limited to explaining a single layer.
Vector-concepts, on the other hand, associate a concept with a vector in the latent space.The base work in this direction still disregarded the distributed nature of CNN representations: The Network Dissection approach [3] aims to associate each convolutional filter in a CNN with a semantic concept.Its successor Net2Vec [10] corrects this issue by associating a concept with a linear combination of filters, resulting in a concept being globally represented by a vector in the feature space, the concept activation vector (CAV) [23].A sibling state-of-the-art method for associating concepts with latent space vectors is TCAV [23], which also uses a linear model attached to a CNN layer to distinguish between neurons (in contrast to filters as in Net2Vec) relevant to a given concept and the rest.TCAV also proposes a gradient-based approach that allows for the evaluation of how sensitive a single prediction or complete class is to a concept.The concept sensitivity (attribution) for a model prediction is calculated by taking the dot product between the concept activation vector and the gradient vector backpropagated for the desired prediction.These vector-concept baselines for classification (TCAV) and segmentation (Net2Vec) of concepts have been extended heavily over the years, amongst others towards regression concepts [14,15], multi-class concepts [21], and locally linear [46,47] and non-linear [21] CAV retrieval.However, the core idea remained untouched.
While the TCAV paper already identifies stability as a potential issue, they reside to significance tests for large series of experiments leaving a thorough analysis of stability (both for concept retrieval and concept attribution) open, as well as investigation of improvement measures.Successor works tried to stabilize the concept attribution measurement.For example, Pfau et al. [30] do not use the gradient directly, but the average change of the output when perturbing the intermediate output towards the CAV direction in latent space in different degrees.This gradient stabilization approach follows the idea of Integrated Gradients [41], but no other approaches like Smoothed Gradients [40] have been tried.Other approaches also suggest improved metrics for global concept attribution [15].However, to our knowledge, stability remained unexplored so far.
We address this gap by utilizing TCAV as a baseline global concept vector representation for the stability estimation.Moreover, as gradient-based method, it be adapted to estimate concept attributions in other model types, such as ODs (see Sec. 3.2).It is important to note that our stability assessment method is not limited to TCAV and can potentially be applied to evaluate the stability of other global concept representations.

Unsupervised Concept Analysis
Unsupervised methods for analyzing concepts are also referred to as concept mining [36].These methods do not rely on pre-defined concept labels, but the acquired concepts are not always meaningful and require manual revision.There are two main approaches to concept mining: clustering and dimensionality reduction.Clustering methods, such as ACE [12] and VRX [11] group latent space representations of image patches (superpixels), obtained through segmentation algorithms.The resulting clusters are treated as separate concepts and can be used for supervised concept analysis.Invertible Concept Extraction (ICE) [49] Fig. 1: The framework for estimation of CAV stability and concept attribution stability.The proposed solution utilizes unsupervised ICE to aid concept discovery and labeling, while supervised TCAV is used for the generation of concept representations.is a dimensionality reduction method based on non-negative matrix factorization.It mines non-negative concept activation vectors (NCAVs) corresponding to the most common patterns from sample activations in intermediate layers of a CNN.The resulting NCAVs are used to map sample activations to concept saliency maps, which show concept-related regions in the input space.
To reduce the need in concept labeling, we opted to use ICE for unsupervised concept mining due to (1) its superior performance regarding interpretability and completeness of mined concepts compared to clustering [49], and (2) its simpler and more straightforward pipeline with less hyperparameters.Unlike ACE, it does not rely on segmentation and clustering results as an intermediate step, which makes it easier to apply.

Concept Analysis in Object Detection
There are only a few existing works that apply concept analysis methods to object detection, due to scalability issues.In [35] the authors adapt Net2Vec for scalability to OD activation map sizes, which is later used to verify compliance of the CNN behavior with fuzzy logical constraints [38].Other TCAV-based works apply lossy average pooling to allow large CAV sizes [14,7], but do not test OD CNNs.However, these methods are fully supervised and require expensive concept segmentation maps for training, resulting in scalability issues regarding concept label needs.In order to reduce the need for concept labels, we propose adapting and using a jointly supervised and unsupervised classification approach for object detection, and investigate the impact of CAV size on stability.This also closes the gap that, to our knowledge, no unsupervised CA method has been applied to OD-sized CNNs so far.

Proposed Method
The overall goal targeted here is a CA framework that allows stable, labelefficient retrieval and usage of interpretable concepts for explainability of both classification and OD backbones.To address this, we introduce a framework that combines unsupervised CA (for semi-automated enrichment of the available concept pool) with supervised CA (for retrieval of CAVs and CNN evaluation) together with an assessment strategy for its stability properties.An overview of the framework is given in the following in Sec.3.1, with details on how we adapted CA for OD in Sec.3.2.Sec.3.3 then presents our proposal of CAV stability metrics.Lastly, one of the potential influence factors on stability, namely CAV dimensionality and parameter reduction techniques, is presented in Sec.3.4.

Stability Evaluation Framework
The framework depicted in Fig. 1 aims to efficiently combine supervised and unsupervised CA methods for use in explainability or evaluation purposes, like our CA stability evaluation.To achieve this it (1) builds an extensible Concept Pool containing human-validated Mined Concepts extracted from trained Model Under Test, and (optionally) existing manually Labeled Concepts; and it (2) uses these concepts to obtain CAVs and, e.g., conduct CAV Stability and Concept Attribution Stability tests on object detection and classification models.
Concept Pool Creation/Extension.In some CV domains, it can be challenging to find publicly available datasets with high-quality concept labels.In order to streamline the manual annotation process and speed up concept labeling, we utilize unsupervised concept mining.The left side of Fig. 1 depicts the process of creating the Concept Pool (or extending it, if we already have an initial set of Labeled Concepts) by employing the Concept Miner.A concept in the concept pool is represented by a set of images or image patches showing the concept.To extract additional Mined Concepts, the Concept Miner identifies image patches that cause common patterns in the CNN Image Activations.The activations are extracted from the layer of interest of the Backbone of the Model Under Test for Input Images from the mining set.In our work, we utilize ICE [49] as the Concept Miner to obtain the image patches.The workflow of ICE is as follows: (1) it first mines NCAVs; then, for each NCAV and each sample from a test set (2) it applies NCAV inference, i.e., obtains a (non-binary) heatmap of where the NCAV activates in the image, and (3) masks the input image with the binarized heatmap.For details see Sec.2.2 and [49].The sets of mined image patches, alias concepts, next undergo Manual Concept Validation: A human annotator assigns a label to each Mined Concepts.These Interpreted Concepts, if meaningful, can either directly be added to the set of Labeled Concepts or be utilized in Synthetic Concept Generation to obtain more complex synthetic concept samples (see Sec. 4.4 and Fig. 3 for more details and visual examples).It should be noted that the Concept Pool, once established, is model-agnostic and can be reused for other models, and that the ICE concept mining approach can be exchanged by any other suitable unsupervised CA method that produces concept heatmaps during inference.
Concept Stability Analysis.Now that the Concept Pool is established, we can perform supervised CA to obtain CAVs for the concepts in the pool.The CAV training is done on the Concept Activations, i.e., CNN activations of concept images from the Labeled Concepts in the Concept Pool.Given CAVs, we can then calculate per-sample concept attribution using, e.g., backpropagation-based sensitivity methods [23].The resulting CAVs and Concept Sensitivity Scores can then be used for local and global explanation purposes.To ensure their quality, this work investigates stability (CAV Stability and Concept Attribution Stability) of these for OD use-cases, as detailed in Sec.3.3.
For supervised CA we use the base TCAV [23] approach: A binary linear classifier is trained to predict presence of a concept from the intermediate neuron activations in the selected CNN layer.The classifier weights serve as CAV, namely the vector that points into the direction of the concept in the latent space.The CAVs are trained in a one-against-all manner on the labeled concept examples from the Concept Pool.For concept attribution, we adopt the sensitivity score calculation from [23]: for a sample is the partial derivative of the CNN output in the direction of the concept, which is calculated as the dot product between the CAV and the gradient vector in the CAV layer.In this paper, we are interested in the stability of this retrieval process for obtaining CAVs and respective concept attributions.

Concept Analysis in Object Detectors
The post-hoc concept stability assessment framework described above, in particular the used TCAV and ICE methods, is out-of-the-box suitable for use with classification models.However, object detection networks pose additional challenges: besides larger sizes, they have different prediction heads and employ suppressive post-processing of the output.Multiple Predictions.Unlike classification models that produce a single set of predictions per sample, object detectors may produce multiple predictions, requiring adaptions to TCAV and ICE.
For ICE the concept weights and importance estimation component require adjustments.The pipeline assesses the effect of small modifications to each concept on the final class prediction.For classification, this estimation is performed on a per-sample basis.For object detection, we switch that calculation to the per-bounding box approach.
The TCAV process of calculating CAVs remains unchanged.However, TCAV employs gradients backpropagated from the corresponding class neuron and concept CAV to assess the concept sensitivity of the desired output class.In object detectors, concept sensitivity can be computed for each prediction, or bounding box, by starting the backpropagation from the desired class neuron of the bounding box.
It is important to note that some object detection architectures predict an objectness score for each bounding box, which can serve as an alternative starting neuron for the backpropagation [24].Nonetheless, we only use class neurons for this purpose in our experiments.Suppressive Post-processing.Another challenge in object detection is explanation of False Negatives (FNs), which refer to the absence of detection for a desired object.Users may be especially interested in explanations regarding FN areas, e.g., for debugging purposes.While the raw OD CNN bounding box predictions usually cover all image areas, post-processing may filter out bounding boxes due to low prediction certainty or suppress them during Non-Maximum Suppression (NMS).To still evaluate concept sensitivity for FNs, we compare the list of raw unprocessed bounding boxes with the desired object bounding boxes specified by the user.We then use Intersection over Union (IoU) to select the best raw bounding boxes that match the desired ones, and these selected bounding boxes (i.e., their output neurons) are used for further evaluation.

Evaluation of Concept Stability
Concept Retrieval Stability.We are interested in concepts that are both consistent and separable in the latent space.However, these two traits have not been considered jointly in previous work.Thus, we define the generalized concept stability S L k metric for a concept C in layer L k applicable to a test set X as where, separability C L k (X) represents how well tested concepts are separated from each other in the feature space, consistency C L k denotes how similar are representations for the same concept when obtained with different initialization conditions.
Separability.The binary classification performance of each CAV reflects how effectively the concept is separated from other concepts, when evaluated in a concept-vs-other manner rather than a concept-vs-random approach.In the concept-vs-other scenario, the non-concept-class consists of all other concepts, whereas it is a single randomly selected other concept in the concept-vs-random scenario [23].We choose the separability from Equ. 1 for a single concept C on the test set X as: where Consistency.In TCAV, during the CAVs training, a limited amount of concept samples may lead to model underfitting, and significant inconsistency between CAVs obtained for different training samples and initialization conditions [23].Since cosine similarity was shown to be a suitable similarity measure for CAVs [23,10] we set the consistency measure to the mean cosine similarity between the CAVs in layer L k of N runs: where cos(−, −) is cosine similarity, here between CAVs of the same concept C and layer L k obtained during different runs i, j.
Concept Attribution Stability.Small changes in the input space may significantly change the output and, thus, the gradient values.TCAV requires gradients to calculate the concept sensitivity (attribution) of given prediction.Hence, gradient instability may have an impact on the explanations, and, in the worst case, change it from positive to negative attribution or vice versa.We want to check, if such instability of gradient values influences concept detection.For this, we compare the vanilla gradient approach against a stabilized version using the state-of-the-art gradient stabilization approach Smooth-Grad [40].It diminishes or negates the gradient instability in neural networks by averaging vanilla gradients obtained for multiple copies of the original sample augmented with a minor random noise.For comparison purposes, first the vanilla gradient is propagated backward with respect to the detected object's class neuron.This neuron is remembered and used then for the gradient backpropagation for noisy copies of SmoothGrad.TCAV concept attributions can naturally be generalized to Smoothgrad, defining them as: where attr * C is the attribution of concept C in layer L k for vanilla gradient ( * = grad) or SmoothGrad ( * = SG) for a single prediction for sample x, , and f →L k is the CNN part up to L k , f L k → the mapping from L k representations to the score of the selected prediction and class.
Acc.As one approach, for each tested layer we build a confusion matrix for multiple test samples and bounding boxes therein, where y true = sign(attr grad i ) and y predicted = sign(attr SG i ) are predictions to compare the sign of concept attribution for SmoothGrad and vanilla gradient.On this, accuracy (Acc) is used to show the fraction of cases where SmoothGrad and vanilla gradient concept attributions have the same sign, i.e., where gradient instability has no impact.
CAD.As a second approach, to qualitatively evaluate the difference between the concept attribution of SmoothGrad and the vanilla gradient in the tested layer, we introduce the Concept Attribution Deviation (CAD) metric.It shows the average absolute attribution value change for all used concepts C and N runs, and, thus, describes the impact of gradient instability on concept attribution in a layer: . (5)

CAV Dimensionality
The stability can be greatly affected by the number of CAV parameters, which is especially important in object detectors with large intermediate representations.Moreover, the larger CAV size leads to increased memory and computation requirements.The original TCAV paper proposes using 3D-CAV-vectors [23].However, alternative translation invariant 1D- [10,49] and channel invariant 2D-CAV-representations, which have less parameters, are possible.If 3D-CAV's dimensions of OD's arbitrary intermediate layer are C × H × W , then dimensions of 1D-and 2D-CAV are C × 1 × 1 and 1 × H × W respectively, where C, H and, W denote channel, height and, width dimensions respectively (see Fig. 2).The 1D-CAV provides during inference one presence score per channel, and possesses the property of translation invariance.This implies that only the presence or absence of a concept in the input space matters, rather than its size or location.In contrast, the 2D-CAV concentrates solely on the location of the concept, providing one presence score for each activation map pixel location.This can also be advantageous in certain circumstances (e.g., for the concepts "sky" or "ground").The 3D-CAV provides during inference a single concept presence score for the complete image, depending both on location, size, and filter distribution of the concept.Meanwhile, it comes with the disadvantage of larger size and higher computational requirements.
Original 3D-CAVs do not require special handling of the latent space.But for evaluation of 1D-and 2D-CAVs, we preprocess incoming latent space vectors to match the CAV dimensionality by taking the mean along width and height, or channel dimensions respectively, as already successfully applied in previous work [7,14].In other words, for the calculation of CAV with reduced dimensions, we aggregate activation functions and gradients along certain dimensions.CAV dimension size is a hyperparameter, which may impact CAV memory consumption, CAV stability, the overall performance of concept separation, CAV training speed, and following operations with CAVs (e.g., evaluation of the concept attribution).Thus, we also propose using our stability metrics for the selection of the optimal CAV dimension size.

Experimental Setup
We use the proposed framework to conduct the following experiments for OD and classification models: 1) evaluation of concept representation stability via the selection of representation dimensionality; 2) inspection of the impact of gradient instability in CNNs on concept attribution.The process of concept analysis in classifiers can be carried out using the default approaches proposed in the original papers [23,49], and it does not require any special handling.
In the following subsections, we describe selected experimental datasets and concept data preparation, models, model layers, and hyperparameter choices.Experiment results and interpretation are described later in Section 5.

Datasets
Object Detection.For unsupervised concept mining in object detectors and experiments with ODs, we use the validation set of MS COCO 2017 [26] dataset, containing 5000 real world images with 2D object bounding box annotations, including many outdoor and urban street scenarios.We mine concepts from bounding boxes of person class with the area of at least 20000 pixels, so the mined concept images have reasonable size and can be visually analyzed by a human.The resulting subset includes more than 2679 bounding boxes of people in different poses and locations extracted from 1685 images.
Classification.For concept stability experiments with classification model, we use BRODEN [3] and CycleGAN Zebras [50] datasets.BRODEN contains more than 60,000 images image and pixel-wise annotations for almost 1200 concepts of 6 categories.CycleGAN Zebras contains almost 1500 images of zebras suitable for supervised concept analysis.

Models
To evaluate the stability of semantic representations in the CNNs of different architectures and generations, we selected three object detectors and three classification models with various backbones.

Layer Selection for Concept Analysis
To identify any influence of the layer depth on extracted concept stability, we must analyze the latent space of DNNs across multiple layers.To accomplish this, we extract intermediate representations and concepts from ten intermediate convolutional layers of ODs and seven intermediate convolutional layers of classifiers.These layers are uniformly distributed throughout the backbones of CNNs.The names of the selected layers for each network are listed in Tab. 1 and Tab. 2, where each layer is identified by a symbolic name in the format of l x , where x denotes the relative depth of the layer in the backbone (i.e., layers from l 1 to l 7 for classifiers and from l 1 to l 10 for ODs).

Synthetic Concept Generation and Concept Selection
Object Detection.To conduct concept analysis experiments with object detectors, we generate synthetic concept samples using concept information extracted from MS COCO (see Fig. 1 and Sec.3.1).We used ICE [49] to mine conceptrelated superpixels (image patches) from MS COCO bounding boxes of the person class that have an area of at least 20,000 pixels.Then, we visually inspected 30 mined concepts (10 for each following YOLO5 layer: 8.cv3.c,9.cv1.c,and 10.c; see caption of Tab. 2 for notations) and selected 3 concepts semantically corresponding to labels "legs", "head", and "torso".Interestingly, we found that several concepts (e.g., "head", "legs") were present in more than one layer.We only picked one of the concepts of the same type based on the subjective quality.For each selected concept, we save 100 concept-related superpixels using a concept mask binarization threshold of 0.5.
Examples of the MS COCO synthetic concepts can be seen in Figure 3.To generate a synthetic concept sample of a size of 640 × 480 pixels, 1 to 5 conceptrelated superpixels are selected and placed on a background of random noise drawn from a uniform distribution (alternatively, images of natural environments can be used as a background).Additionally, random scaling is applied to the superpixels before placement with a random factor between 0.9 and 1.1.
Classification.We use labeled concepts "stripes", "zigzags", and "dots" from BRODEN dataset to analyze the stability of concept representation and attribution in classification models on the examples of zebra images from the CycleGAN dataset.

Experiment-specific Settings
Experiment 1: CAV Stability and Dimensionality.We conduct CAVstability experiments for 1D-, 2D-, and 3D-CAVs (see Sec. 3.4) with YOLO5, RCNN, SSD, ResNet, SqueezeNet, and EfficientNet models to measure the potential concept retrieval stability in different networks and setups.For stability measurement, the number N of CAV retrieval runs with different initialization parameters is set to 15, which is similar to the ensemble size in [31], as we observed it is a good trade-off regarding computational speed.In each run, we utilize 100 samples per concept, dividing them into 80 for concept extraction and 20 for validation (estimation of f 1).
To further examine the influence of the number of concept training samples on CAV stability, we also test three additional setups with 20, 40, and 60 training concept samples.The test has been conducted for all six networks.Experiment 2: Gradient Stability in Concept Detection.For gradient stability experiments, ResNet and YOLO5 are selected as models with the best CAV stability from Experiment 2.Moreover, we validate setups with 1D-and 3D-CAVs to see how gradient instability affects concept attribution in CAVs of different dimensionality.For the computation of SmoothGrad, we use the hyperparameter values recommended in [40]: the number of noisy copies N is set to 50, and the amount of applied Gaussian noise is set to 10%.

CAV Stability and Dimensionality
The CAV stability results for 1D-, 2D-and 3D-CAVs in different layers of YOLO5, RCNN, SSD, ResNet, SqueezeNet, and EfficientNet networks are presented in Tabs. 3 to 8. In addition, Figs. 4 to 9 visualize the impact of number of training concept samples on the overall stability of 1D-, 2D-and 3D-CAVs.that can sometimes even outperform that of 1D-CAVs.This is typical for classifiers, where, for instance, in all layers of ResNet (Tab.6) f 1 of 3D-CAVs is the highest.However, they for all models exhibit mediocre CAV consistency (cos), possibly due to the larger number of parameters and a relatively small number of training concept samples.Overall, 3D-CAVs are less stable than 1D-CAVs, but still can be used for CA.In contrast, 2D-CAVs exhibit relatively high consistency (e.g., in Tab.6, layers l 5 , l 6 , and l 7 have the top cos values for 2D-CAVs), but they have the worst concept separation (f 1), as observed in all tables.As a result, the overall 2D-CAV stability in all models is the worst.In 2D-CAVs, no distinction is made between different channels in the latent space due to 3D-to-2D aggregation.The noticeable reduction of concept separation (f 1) in 2D-CAVs reinforces the assumption made in other works (e.g., [3,10]) that concept information is encoded in different convolutional filters or their linear combinations.
1D-CAVs achieve the best overall CAV-stability due to their (mostly) best consistency (cos) and good concept separation (f 1).Moreover, 1D-CAVs have the advantage of fast computation speed since they have fewer parameters.These unique features of 1D-CAVs make them highly stable even in shallow layers, where other CAVs may experience low stability.For example, in Tab. 3, the stability of 1D-CAVs in layer l 1 S L k = 0.732 is substantially higher than that of 2D-and 3D-CAVs, which are only 0.223 and 0.199, respectively.Based on our empirical findings, we recommend using 1D-CAV as the default representation for most applications due to its superior overall stability.However, for safety-critical applications, we advise using our stability assessment methodology prior to CA.
Concept Abstraction Level Impact.In OD models, experiments are conducted with concepts of medium-to-high levels of abstraction (complex shapes and human body parts), which are usually detected in middle and deep layers of the network [45].Thus, it is expected that there will be worse concept separation (f 1) in shallow layers, and this has indeed been observed across all dimension sizes of CAVs (as shown in Tabs.[3][4][5][6][7][8]. However, this observation is not always valid for 2D-CAVs, as results have shown that concept separation drops in some deeper layers.For instance, in Tab. 4 l 4 and l 7 have f 1 values 0.420 and 0.448, while for l 1 it is 0.530.Also, Tab. 4 shows that the increase of f 1 for 2D-CAVs is not as high as it is for 1D-and 3D-CAVs.The range of f 1 for 2D-CAVs is between 0.420 to 0.659, whereas for 3D-CAVs, it is between 0.536 to 0.941.These findings further support However, labeling concepts is a time-consuming and expensive process.Therefore, we recommend using at least 40 to 60 concept-related samples for training each CAV.In most cases, the stability obtained with 80 samples is only marginally better than that obtained with 40 (see Fig. 8) or 60 samples (see Fig. 4 and Fig. 6).The CAV stability differences among inspected architectures can also be observed in Figs. 4 to 9. For example, in the case of 1D-CAV of ResNet (Fig. 7) and 1D-and 3D-CAVs of SqueezeNet (Fig. 8), we observe that the stability value quickly reaches its optimal values in the first one or two layers and remains similar in deeper layers.In other cases, such as 3D-CAV of SSD (Fig. 6) or all CAV dimensions of RCNN (Fig. 5), stability gradually increases with the relative depth of the layer.Finally, the stabilities of 1D-and 3D-CAVs of YOLO5 (Fig. 4) or 1D-and 3D-CAVs of EfficientNet (Fig. 9) grow until an optimal layer in the middle and slowly shrink after it.

Gradient Stability in Concept Detection
Based on the experimental results, it can be concluded that the negative impact of gradient instability on concept analysis using TCAV is minimal.The results presented in Tabs.9 and 10 are based on 1500 concept attribution predictions (see Eq. 4) for 500 images and 3 concepts per image, for each tested layer of ResNet with 1D-and 3D-CAVs, respectively.Similarly, Tabs.11 and 12 are built for each tested layer of YOLO5 with 1D-and 3D-CAVs, respectively, using 2136 concept attribution predictions for 712 bounding boxes and 3 concepts per bounding box.SmoothGrad Impact.In the Tabs.9 to 12, the relative depth of CNN backbone layers is increasing from left to right, while gradient backpropagation depth from outputs to CAV layer is increasing in right to left order.As expected, the gradient is becoming more unstable with backpropagation depth [40], resulting in higher CAD values in shallow layers compared to deeper layers.The higher number of concept attribution sign flips is observed in shallow layers (see Sec. 3.3), where accuracy (Acc) values in those layers are low.These observations confirm the negative correlation between CAD and Acc, where CAD increases as Acc decreases.This suggests that gradient smoothing techniques, such as Smooth-Grad, can have a higher impact on concept attribution values in shallow layers, where the gradient instability is higher.
Despite the negative correlation between CAD and Acc values, the overall accuracy values remain above 0.9 for all layers in the provided tables.The lowest accuracy value for ResNet of Acc = 0.90 is observed in Tab. 10 for l 1 .For YOLO5 the lowest Acc = 0.91 is obtrained in l 5 (Tab.12).This indicates that the sign of concept attribution is only changed for a minority of predictions across all tested networks and configurations.However, it is worth noting that CAD values can be high in shallow layers, for instance, CAD = 31.3%at layer l 1 of Tab. 10, resulting in a higher rate of concept attribution sign flipping compared to deeper layers.The use of SmoothGrad comes at a higher computational cost compared to vanilla gradient.It is more than N times (number of noisy copies) computationally expensive, and mostly impacts concept attribution in shallow and middle layers of networks.Therefore, it is advisable to use SmoothGrad when conducting concept analysis in shallow layers of networks with large backbones such as ResNet101 or ResNet152.
CAV Dimensionality Impact.The use of 1D-CAV representations generally results in lower CAD values than 3D-CAVs, typically with a difference of 2-3%.This behavior can be attributed to the higher stability of 1D-CAVs, which is in turn caused by the lower number of parameters.The observation is consistent across all layers of ResNet and the majority of YOLO5 layers, as shown in Tabs.9 to 12.However, the dimensionality of CAV does not affect the behavior of gradient instability in other regards: CAD remains higher and Acc lower in shallow layers regardless of the CAV dimensionality.

Conclusion and Outlook
This study proposes a framework and metrics for evaluating the layer-wise stability of global vector representations in object detection and classification CNN models for explainability purposes.We introduced two stability metrics: concept retrieval stability and concept attribution stability.Also, we proposed adaptation methodologies for unsupervised CA and supervised gradient-based CA methods for combined, labeling-efficient application in object detection models.
Our concept retrieval stability metric jointly evaluates the consistency and separation in the feature space of concept semantic concept representations obtained across multiple runs with different initialization parameters.We used the TCAV method as an example to examine factors that affect stability and found that aggregated 1D-CAV representations offer the best performance.Furthermore, we determined that a minimum of 60 training samples per concept is necessary to ensure high stability in most cases.
The second metric, concept attribution stability, assesses the impact of gradient smoothing techniques on the stability of concept attribution.Our observations suggest that 1D-CAVs are more resistant to gradient instability, particularly in deep layers, and we recommend using gradient smoothing in shallow layers of deep network backbones.
Our work provides valuable quantitative insights into the robustness of concept representation, which can inform the selection of network layers and concept representations for CA in safety-critical applications.For future work, it will be interesting to apply the proposed approaches and metrics to alternative global concept vector representations and perform comparative analysis.

Fig. 3 :
Fig. 3: Examples of synthetic concept samples generated using concept superpixels obtained from MS COCO.

Fig. 4 :Fig. 5 :
Fig. 4: Impact of number of concept samples on CAVs stability for YOLO5

Fig. 6 :Fig. 7 :
Fig. 6: Impact of number of concept samples on CAVs stability for SSD

Fig. 8 :Fig. 9 :
Fig. 8: Impact of number of concept samples on CAVs stability for SqueezeNet

Table 1 :
Shorthands l i of selected classification CNN intermediate layers for Concept Analysis (l=layer, b=block, f=features, squeeze=s).

Table 2 :
Shorthands l i of selected OD CNN intermediate layers for Concept Analysis (b=block, f=features, e=extra, c=conv).

Table 3 :
Stability of generated CAVs of different dimensions for YOLO5.

Table 4 :
Stability of generated CAVs of different dimensions for RCNN.

Table 5 :
Stability of generated CAVs of different dimensions for SSD.

Table 6 :
Stability of generated CAVs of different dimensions for ResNet.

Table 7 :
Stability of generated CAVs of different dimensions for SqueezeNet.

Table 8 :
Stability of generated CAVs of different dimensions for EfficientNet.
Architecture Impact.From Tabs. 3 to 8 we see that top CAV stability (S L k ) values achieved by ODs and classifiers for CAVs trained on the same concept datasets are very similar.However, due to architectural differences, the top stability values are achieved at different relative layer depths.For example, the top stabilities for 1D-CAVs in YOLO5, RCNN, and SSD object detectors are achieved in layers l 6 , l 9 , and l 6 , respectively, with corresponding values of 0.915, 0.882, and 0.909 (see Tabs. 3, 4, and 5).Similarly, the top stability values for 1D-CAVs for ResNet, SqueezeNet, and EfficientNet classifiers are achieved in layers l 5 , l 7 , and l 6 , respectively, with corresponding values of 0.900, 0.876, and 0.885 (Tab.6, 7, and 8).The same tables show that the layers with top stability values may vary for different sizes of CAV dimensions even within the same model (e.g., in Tab. 3, the YOLO5 top stabilities for 1D-, 2D-, and 3D-CAV are obtained in layers l 6 , l 9 , and l 8 , respectively).

Table 9 :
Gradient stability in layers of ResNet for 1D-CAV.

Table 10 :
Gradient stability in layers of ResNet for 3D-CAV.

Table 11 :
Gradient stability in layers of YOLO5 for 1D-CAV.