Keywords

1 Introduction

Deep neural networks (DNN), in particular convolutional DNNs (CNNs), have achieved remarkable success in computer vision [55], but are known to be vulnerable to adversarial evasion attacks (AA) [1, 28, 34, 52]: For a given input an attacker can easily find targeted, small perturbations of the input that are seemingly irrelevant [6] or even imperceptible [28] to humans, but strongly change the CNN output [52], possibly causing erroneous outputs. CNN vulnerability to AAs has raised significant concerns regarding their reliability and robustness [23]. While the impact of AAs on model predictions has been extensively studied [1, 23, 32, 34], their effect on the internal representations learned by these models remains largely unexplored. Understanding how AAs influence the learned representations is crucial for developing robust CNNs, as well as effective defenses against adversarial threats.

Recent advancements in XAI have provided valuable tools for probing the internal representations learned by deep neural networks [18, 30, 47, 49]. In particular, XAI methods for unsupervised post-hoc concept embedding analysis [48] have proven successful in identifying frequent patterns in latent representations of trained CNNs that are used for the CNN’s decision-making [20, 22, 54]. These reoccurring patterns respectively linear components are referred to as concepts.

In this work, we perform an in-depth analysis of the influence of strong targeted AAs on the usage of concepts learned by CNNs using concept-based XAI techniques. Inspired by the perspective of Ilyas et al. [28] that “adversarial examples are no bugs, they are features”, we hypothesize that AAs exploit and manipulate the learned concepts within latent spaces of these models. Through a comprehensive set of experiments across various network architectures and targeted strong white-box AA types, we unveil several key findings that shed light on the nature of AAs and their impact on learned representations:

  1. (1)

    Adversarial attacks substantially alter the concept composition learned by CNNs, introducing new concepts or modifying existing ones.

  2. (2)

    Adversarial perturbations can be decomposed into linear components, only a subset of which is primarily responsible for the attack’s success.

  3. (3)

    Different attacks comprise similar components, suggesting they nudge the CNN intermediate outputs towards specific common adversarial feature space directions, albeit with varying magnitudes.

  4. (4)

    The learned adversarial concepts are mostly specific to the target class, agnostic of the class the attack starts from. This implies that attacks abuse target-specific feature space directions.

These insights into AAs’ impact on learned representations open up new directions for the future design of more robust models and adversarial defenses.

The remainder of the paper is structured as follows: After setting this paper in the context of related work (Sect. 2), we introduce the necessary background on used AA and XAI techniques (Sect. 3), and the common experimental setup (Sect. 4) for investigating our hypothesis. Section 5 then presents the experimental studies and their results. We conclude in Sect. 6 with a discussion of implications for the understanding and handling of AAs.

2 Related Work

Adversarial Attacks. AAs on CNNs have received increased attention [1] since the first description of the phenomenon in 2014 [23, 52]. These attacks generate perturbations to input data that can lead to erroneous model predictions with high confidence [23]. Numerous attack methods have been proposed, ranging from digital to physically implemented manipulations [15], and from white-box attacks against fully known models to purely query-based ones [1]. Commonly used strong AAs, i.e., ones achieving the largest output deviation at minor input changes, are the here considered white-box attacks [1, 8]. First of its kind was the gradient-based Fast Gradient Sign Method (FGSM) [23], refined to the Basic Iterative Method (BIM) [32], and Projected Gradient Descent (PGD) [34]. These were later superseded by the optimization-based Carlini and Wagner (C &W) [8] attack. Another common type of attack is adversarial patch [6], which aims to simulate real-world scenarios by restraining perturbations to be local. This allows them to be physically realized, e.g., via stickers [15].

Considerable efforts were invested to develop defenses against adversarial attacks [1, 3]. However, many of these are still vulnerable to adaptive attacks [2, 7], highlighting the ongoing arms race between attacks and defenses. Meanwhile, the underlying nature and causes of adversarial vulnerabilities is not satisfyingly clarified: Hypotheses include that they origin from invisible, class-correlated features in the data [28], high-frequency image features [53], high dimensionality [21] and curvature [13] of CNN latent spaces, and error amplification along singular vulnerable latent features [33] (which we here even find to be superlinear, see Subsect. 5.1). But none could be proven to cover all cases of adversaries, leaving the nature of AAs an open question.

Concept-Based XAI Methods. While it might be more desirable to use fully transparent models [43], black-box CNNs are hard to replace for vision tasks [55]. Fortunately, post-hoc XAI methods allow to explain many aspects of a trained black-box CNN without changing it [47], including, e.g., the importance of input features for decisions [44], more general model behavior (using simplified surrogates) [10, 41], and internal information flow [25]. The question what information is encoded in CNN intermediate outputs is tackled by the subfield of concept-based XAI methods: Here, CNN latent space structures are associated with human-understandable symbolic concepts [46]. Early works started to explain the meaning of single CNN units [4, 38]. Kim et al. [30] soon after found that more generally vectors respectively directions in CNN latent space are a favorable target for explaining the information encoded in CNN intermediate outputs (other than non-linear approaches [14]). Linear approaches nowadays either work supervised, i.e., find concept vectors for user-pre-defined concepts [18, 30]; or, as considered here, unsupervised, by extracting main linear components from CNN intermediate outputs, to then visualize and interactively interpret them [22, 54]. Note that, unlike supervised concept-based analysis, these patterns, i.e., concepts, are not always human-interpretable. The concept discovery is commonly achieved through techniques such as clustering of activations [22], activation pixels [40], and—more effective—matrix decomposition methods like NMF [16, 54], PCA [16, 54], ICA [16] or RCA [16]. This study will focus on NMF and PCA, as the former provides more interpretable concepts, while the latter offers greater numerical efficiency [54]. Visualization of a component is done by highlighting spatial input regions where the component is prominent in CNN activation map locations. That way, unsupervised concept-based approaches allow us to gain insights into the semantic features the CNN has learned to use, enabling us to observe changes in used knowledge at the semantic level. This is here leveraged to assess the influence of adversarial attacks on CNN knowledge.

Adversarial Attacks and XAI. While adversarial attacks and XAI techniques have been extensively studied, their intersection has received limited attention. Recent interdisciplinary works have only explored the use of feature importance XAI methods for the detection of adversarial samples [9, 17, 19, 29, 42, 50]; and showed, that both feature importance [13, 21] and concept representations [5] can be attacked, i.e., modified in a targeted manner via input perturbations. However, when targeting CNN outputs, the impact of AAs on the learned representations and concepts within deep neural networks remains largely unexplored. First investigations on the nature of AAs only investigated origins in input features [28]. Later ones, going inside the CNN, were restricted to overall structural properties like dimensionality [21], curvature [13] and decision boundary visualization [45], or the impact of singular neurons to amplify errors [33]. To our knowledge, we are the first to take a look at the interplay between AAs and learned concepts in CNN latent spaces.

3 Background

In this section, we provide details of adversarial attacks (Sect. 3.1), concept discovery (Sect. 3.2), and concept comparison methods (Sect. 3.3) used in this work.

3.1 Adversarial Attacks

Given an input image \(x\in \mathbb {R}^{h\times w}\) and a classification model \(f:\mathbb {R}^{h\times w}\rightarrow Y\) with original prediction \(y = f(x)\in Y\), the primary objective of an adversarial attack is to discover or generate a perturbation \(\delta \) of maximum size \(\Vert \delta \Vert \le \epsilon \) such that the output of f on the adversarial example \(x+\delta \) changes (\(f(x+\delta )\ne y\), respectively the output difference \(\Vert f(x+\delta )-f(x)\Vert \) is large for chosen metric \(\Vert \cdot \Vert \)). The size constraint ensures that x and its perturbed version \(x+\delta \) are similar respectively indifferentiable for humans.

In this work, we aim to investigate the impact of adversarial attacks on the level of concepts, and in particular relative to the AA’s impact on the DNN output. As this requires fine-grained control on the severity and direction of attacks, we here focus on targeted white-box optimization-based attacks. Targeted attacks aim to manipulate the model’s prediction from the true class y to a specific pre-defined target class \(y'\), thus ensuring \(f(x + \delta ) = y'\). White-box attacks utilize model internal information like gradients to generate the perturbation, and optimization-based ones optimize from x to \(x+\delta \) in a controllable step-wise manner. Our utilized attacks cover a wide range of common white-box attack techniques explained in the following: BIM [32], PGD [34], C &W [8] and adversarial patch [6], visualized in Fig. 1.

Fig. 1.
figure 1

Examples of BIM, PGD, C &W, and Patch Attack adversarial samples: “fire truck” attacked with target “banana”.

BIM. The Basic Iterative Method (BIM) [32] is an iterative variant of the original FGSM attack [23]. In BIM, adversarial perturbations are iteratively computed by changing x in small steps in direction of the gradient until \(x+\delta \) is misclassified. Formally:

$$\begin{aligned} x^{t+1} &= \text {clip}_{\epsilon }\left( x^{t} + \alpha \cdot \text {sign}(\nabla _x J(f(x^{t}), y'))\right) , \end{aligned}$$
(1)

where t represents the optimization step, \(J(\cdot )\) denotes the cost function (e.g., cross-entropy), \(\nabla _x\) the gradient at x, \(\alpha \) is a step size hyperparameter, \(\text {sign}(\cdot )\) is the element-wise signum function, and \(\text {clip}_{\epsilon }(\cdot )\) clips \(x+\delta \) element-wise such that it lies in the \(L_\infty \) ball of radius \(\epsilon \) around x.

PGD. Another powerful iterative attack is the Projected Gradient Descent (PGD) [34]. Unlike the value clipping approach used in BIM, the PGD attack projects gradients to the \(L^{\infty }\) \(\epsilon \)-ball around the original image:

$$\begin{aligned} x^{t+1} &= \text {proj}_{\epsilon }\left( x^{t} + \alpha \cdot \text {sign}(\nabla _x J(f(x^{t}), y'))\right) , \end{aligned}$$
(2)

where \(\text {proj}_{\epsilon }\) denotes the projection onto the \(\epsilon \)-ball centered at x.

C &W. The Carlini & Wagner (C &W) [8] attack aims to find the smallest perturbation \(\delta \) that leads to misclassification. It formulates the attack as an optimization problem, using the \(L_p\) norm (typically \(L_2\)) to measure the size of the perturbation. Formally, given a helper function \(h(\cdot )\) fulfilling \(h(x')\le 0 \Leftrightarrow f(x')=y'\), they optimize

$$\begin{aligned} \min _{\delta } \Vert \delta \Vert _p + \beta \cdot h(x + \delta ) , \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _p\) denotes the \(L_p\) norm, and \(\beta \) is a hyperparameter that controls the trade-off between the size of the perturbation and the induced output change. \(h(\cdot )\) can be set to, e.g., \(h(x')=1-J(f(x'),y')\), where \(J(\cdot )\) is the cross-entropy.

Adversarial Patch. An adversarial patch [6] is the most common real-world attack [1, 6, 15]. Here, \(\delta \) is constrained to be local to a region \(\textit{loc}\), i.e., only change pixels within \(\textit{loc}\). The induced input changes may be visible to humans, but are easily overlooked as being irrelevant, and can be physically implemented (e.g., as stickers [15]), cf. Fig. 1. The choice of the patch \(\delta ^{loc}\) and its location loc on the image can vary depending on the specific objectives of the attack [1].

3.2 Concept Discovery with Matrix Factorization

In this work, we aim to investigate the impact of AAs on sample embeddings in DNN latent space, particularly on the occurrence of learned semantic concepts. For this, concept discovery techniques are needed. The goal of such is to identify primary components (the concepts) in latent representations occurring during DNN decision-making. One desirable is that these components can be represented as a linear combination of convolutional filters [18, 30], which also is the underlying assumption for concept discovery with matrix factorization [54]. I.e., concept-related information is distributed across channels of sample activation maps. Zhang et al. [54] show that existing approaches to linear concept discovery, including ones based on clustering [22], principal component analysis (PCA) and non-negative matrix factorization (NMF), can be modeled as matrix factorization problems, which we describe in the following.

For the formalization assume that we are given a set X of b probing input images with activations \(A=f_{\rightarrow L}(X) \in \mathbb {R}^{(b \times h \times w) \times c}\) in a selected layer L of the tested CNN f, where b, h, w, and c are batch, height, width, and channel dimensions respectively. The decomposition techniques aim to find a matrix \(M \in \mathbb {R}^{k \times c}\) of k components/concepts that allow to (approximately) write each activation map pixel \(a\in \mathbb {R}^c\) in A as a linear combination of the components in M. Formally, one optimizes the concepts M and weights \(W\in \mathbb {R}^{(b \times h \times w) \times k}\) such that the reconstruction error is minimized with respect to the Frobenius norm:

$$\begin{aligned} \min _{W, M}\Vert A - W M\Vert . \end{aligned}$$
(4)

To visualize how concepts are activated in new images, one can project concept information back to the input: Assume we already have concept components M from optimization on a concept probing set A. Now we want to find for a new activation map \({A}'\in \mathbb {R}^{(1\times h\times w)\times c}\) which image region activated which concepts. To do so we follow [16, 37, 54] and obtain \(W'{:}{=}A' \cdot M\). Resulting \({W}'\in \mathbb {R}^{(1\times h\times w)\times k}\) holds for each of the k components a concept saliency map of size \(h\times w\) that tells at which activation map pixel how much of the component was part of the pixel vector. Using the spatial alignment of activation map pixels with the input, one can scale such concept saliency maps from any layer to match the input and, e.g., visualize them as overlays (cf. Fig. 3).

PCA. Here, one uses mean-centered activations \(A_\odot = A - \mu _{A}\), where \(\mu _{A}\) is the mean of A, ensuring that the found principal components represent the directions of maximum variance in the data. In PCA, the component tensor \(M^{\text {PCA}}\) represents the top-k largest eigenvectors of the covariance matrix of \(A_\odot \). These eigenvectors capture the directions of maximum variance in the data. The factorization reads:

$$\begin{aligned} A_\odot &\approx W M^{\text {PCA}}. \end{aligned}$$
(5)

NMF. A fundamental constraint of NMF is the non-negativity of activations, i.e., it works on \(A^+\) (with \((\cdot )^+{:}{=}\texttt {ReLU}(\cdot )\), where \(\texttt {ReLu}(z) = \max (0, z)\)), and yields components and weights with non-negative entries (\(M=M^+, W=W^+\)). This constraint aligns with the interpretation that the k components and should represent additive contributions. Formally:

$$\begin{aligned} A^+ \approx W^+ M^+ \;. \end{aligned}$$
(6)

3.3 Concept Comparison

Comparison of concepts is important for the estimation of the similarity of knowledge they represent. Concepts within the same feature space can be effectively compared using various vector operations: cosine similarity [30, 35], vector arithmetic [18], and distance-based metrics [11, 36, 40, 54]. In this work, due to the nature of concepts—a linear combination of convolutional filters—obtained from decomposition-based concept discovery methods, we utilize cosine similarity for comparing any two discovered concepts uv:

$$\begin{aligned} \text {sim} ( u, v ) = \frac{u\cdot v}{\Vert u\Vert \cdot \Vert v\Vert }. \end{aligned}$$
(7)

This ensures that only the direction of vectors is considered for the comparison.

When comparing concepts obtained from different layers and/or models, i.e., in distinct latent spaces, one needs to project the concept information to a common space, such as input or output. One output-based approach are metrics for estimating the attribution of concepts to the CNN outputs. These metrics include gradient-based approaches [11, 30], saliency-based methods [37], or those based on concept similarity ranking [37]. We here compare concepts quantitatively using the Jaccard Index (IoU) of their concept saliency maps after scaling them to match the input size, as proposed in [37].

4 Experimental Setup

In this section we present the selection of models (Sect. 4.1), data used in experiments (Sect. 4.2), and layers selection for the analysis of concepts and internal representations (Sect. 4.3).

4.1 Models

In our experiments we use classification models of different architectures pre-trained on ImageNet-1k [12] dataset from PyTorch model zooFootnote 1:

  • VGG-11 [51] (VGG)

  • Compressed SqueezeNet1.1 [27] (SqueezeNet),

  • Inverted residual MobileNetV3-Large [26] (MobileNet)

  • Residual ResNet18 and ResNet50 [24]. Two networks of the same architecture are used to investigate the impact of the model size.

4.2 Data

We conduct our experiments using diverse classes from a validation subset of the widely used ImageNet-1k (ILSVRC2017) image classification dataset [12]. To facilitate an assessment of the transferability of adversarial attacks and their impact across distinct domains, we selected classes from multiple diverse semantic supercategories. Specifically, we selected four classes from the vehicle supercategory (taxi, fire truck, garbage truck, pickup truck), two classes from the animal supercategory (horse, zebra), and two classes from the fruit supercategory (orange, banana).

For each selected class, we randomly chose 50 images, resized them to resolution \(224 \times 224\), and for each network subjected them to targeted cross-attacks using gradient-based BIM [32], PGD [34], and C &W [8] (see Sect. 3.1). Gradient attacks were executed using torchattacksFootnote 2 [31] library. Additionally, we used the ImageNet-PatchFootnote 3 [39] dataset, which comprises pretrained adversarial patches across 10 categories, to implement network-agnostic patch attacks (see Sect. 3.1). Specifically, we applied patches from the banana category to all images.

As a result, for our experiments, we utilized 400 clean images, 400 patch-attacked images, and a total of \(42000=(8 \times 7)\times 50 \times 3 \times 5\) (\(\text {class pairs}\times \text {images/attack}\times \text {attack types}\times \text {models}\)) gradient-attacked images (see Fig. 1).

Note that the concept discovery is done unsupervised, so no additional segmentation labels are necessary.

4.3 Layer Selection

For our experiments, we defined two groups of layers: layers for latent space comparison and layers for concept discovery. The selected layers for all tested networks are listed in Table 1.

Table 1. Selected CNN layers for latent space comparison and concept discovery (layer identifiers taken from implementations (see footnote 1), abbreviating f = features, l = layer).

Layers for comparing latent space representations (see Sect. 5.1) are evenly distributed along the depth of the model. In these layers, we assess the impact of adversarial attacks on the internal representations of the models.

For the concept discovery in adversarial attacks (see Sect. 5.2, Sect. 5.3) we use another set of layers, which are located deeper in the networks. This allows us to extract and compare high-level abstract concepts. This selection is based on the subjective qualitative assessment of the extracted layers.

5 Experimental Results

In this section, we examine the influence of AAs on sample representations within the latent space (Sect. 5.1). Subsequently, leveraging this analysis, we employ concept discovery (mining) to quantitatively and qualitatively assess the alteration of concepts before and after the attacks (Sect. 5.2). Finally, we analyze the components of adversarial perturbation through concept discovery (Sect. 5.3).

Fig. 2.
figure 2

Mean and standard deviation values of cosine similarities for original and attacked activation maps of test samples for several attacks.

5.1 Adversarial Attacks Impact on Latent Space Representations

Research Question 1: What is the impact of adversarial attacks on sample embeddings within the feature space?

In particular, we would expect to see an amplification of errors introduced in the input along the information flow through layers. To assess this, we evaluate the cosine similarities between attacked and non-attacked samples across each origin-target class pair. Figure 2 show the mean curves and standard deviation intervals for the \(\texttt {garbage truck} \rightarrow \texttt {banana}\), \(\texttt {orange} \rightarrow \texttt {taxi}\), and \(\texttt {pickup} \rightarrow \texttt {zebra}\) attacks across selected layers (Table 1) of all tested networks.

Across all attack types, we find support for the error amplification hypothesis. In particular, we observe a “snowball effect” of cosine similarity decline: the similarity diminishes superlinearly, approximately exponentially, as we move closer to the deeper layers, indicating a higher impact of attacks on internal representations within these layers. This effect is particularly pronounced in deeper networks containing more non-linear layers. In the final layers of ResNet50 and MobileNet, cosine similarity nearly reaches a value of 0.2 for BIM and PGD attacks. In comparison to other attacks, the C &W attack typically induces smaller perturbations, aligning with its original intention of seeking the minimally sufficient perturbation. The perturbation observed in the initial layers of the adversarial patch attack (PATCH) (\(\texttt {garbage truck} \rightarrow \texttt {banana}\)) resembles that of BIM and PGD attacks, yet the decline in similarity is less steep, ending in the last layer at approximately the same level as C &W. Based on these findings, we proceed with the following concept discovery experiment in the deep layers of the network, where activation maps are most substantially perturbed.

Summary: Adversarial attacks cause attacked latent space representations with increasing depth to increasingly deviate from the original representations with respect to cosine similarity. This error amplification over depth is superlinear and holds across all networks, attacks, and origin-target class pairs.

5.2 Concept Discovery in Adversarial Samples

Research Question 2: What is the impact of adversarial attacks on the main components (concepts) present in latent representations?

In Fig. 3, we showcase qualitative and quantitative outcomes of concept discovery using ICE [54] and saliency-based concept comparison [37] in layer4.0 of ResNet18. We focus on original fire truck samples and samples attacked by BIM and C &W targeting taxi (PGD results omitted due to their similarity to BIM; patch attack results omitted as they lead expectedly to the discovery of the banana patches as a distinct concept).

Fig. 3.
figure 3

Concept mining results for BIM and C &W (\(\texttt {fire truck} \rightarrow \texttt {taxi}\)) attacks in layer layer4.0 of ResNet18. Top: pairs of top-2 most relevant prototypes of discovered concepts cX with rank X and concept weights (importances); Bottom: discovered concept similarities (Sect. 3.3) for original vs. BIM (left) and original vs. C &W (right)

Qualitatively, (1) AAs modify the concept information, resulting in concept saliency map changes; e.g., windshield concept (original \(\texttt {c4}\), BIM \(\texttt {c0}\), and C &W \(\texttt {c4}\)) highlights different areas in 1st prototypes. (2) AAs may introduce new concepts; e.g. BIM \(\texttt {c4}\) is a new spurious concept that cannot be interpreted. (3) Additionally, a change in most similar concept prototypes can be observed: e.g., 2nd prototypes of windshield (original \(\texttt {c4}\), BIM \(\texttt {c0}\), and C &W \(\texttt {c4}\)) are different.

Quantitatively, we observe (1) changes of values in concept similarity matrices (Fig. 3, bottom); (2) alterations of concept importance (e.g., concept cabin bottom weights; e.g., original \(\texttt {c0: -7.61}\), BIM \(\texttt {c3: -24.97}\), and C &W \(\texttt {c0: -27.45}\)), which may result in (3) concept rank shuffling (e.g., concept cabin bottom: original \(\texttt {c0: {1st}}\), BIM \(\texttt {c3: {3rd}}\), and C &W \(\texttt {c0: {3rd}}\)).

To find similar concepts (counterparts), columns of the concept similarity matrix are permuted to maximize the sum of the matrix diagonal. The diagonal values indicate the magnitude of similarity between these counterparts. Stronger attacks result in more pronounced perturbations, leading to the emergence of new concepts or the disappearance of old ones. Some concepts may have very weak counterparts, which can be interpreted as no counterpart or that “concept was changed”: e.g., BIM \(\texttt {c4}\) (Fig. 3, bottom left matrix).

Fig. 4.
figure 4

Mean numbers of “concept changes” with 99% confidence intervals for threshold values 75, 50, and 25 in all tested models for tested adversarial attacks.

To measure it, for all discovered concepts for every origin-target attack pair, we threshold to matrix diagonal values and estimate mean and 99% confidence intervals of case counts when the value is lower than used threshold. These results for all networks and threshold values of 75, 50, and 25 are depicted in Fig. 4. These results represent the average number of “concept change” occurrences under the adversarial attacks, serving as a measure of attack strength. Numerically, we observe similar behavior between BIM and PGD. C &W, which induces the smallest perturbations among the tested attacks, results in the lowest number of concept changes across all networks.

In some rare instances, depending on the chosen threshold, no concepts may be modified. However, the impact of the AA can still be observed through the evaluation of changes in concept weights, concept ranks, or concept prototypes. For instance, in Fig. 3, each C &W attack concept (bottom right matrix) corresponds to a counterpart for which the concept weight is known. To estimate changes in weights and concept ranks, Spearman’s rank correlation and Pearson’s correlation can be measured. For the C &W attack and the setup illustrated in Fig. 3, these correlations are 0.6 and 0.76, respectively, enabling the detection of concept changes.

Summary: Comparing concept composition in clean sample representations to ones in adversarially attacked counterparts shows: AAs modify all of concept saliency maps, concept weights respectively ranking, and concept similarities; and even replace concepts in case of strong attacks.

5.3 Concept Analysis of Adversarial Perturbation

Denote by \(f = f_{L \rightarrow }\circ f_{\rightarrow L}\) be the decomposition of the CNN into the part up to layer L an the part from L onwards. The representation of an adversarial perturbation \(\delta \) for a sample x in L’s latent space can be defined as

$$\begin{aligned} \tilde{\delta } = \delta _{x,L} {:}{=}f_{\rightarrow L}(x + \delta ) - f_{\rightarrow L}(x). \end{aligned}$$
(8)

Information Distribution in Adversarial Perturbations. Here, our goal is to discover whether the information within adversarial perturbations \(\tilde{\delta }\) from a given AA type has dominant linear directions or/and has an imbalanced distribution. In other words:

Research Question 3: Can an the adversarial perturbations in latent space that a given AA type causes be efficiently represented using shared linear components that are globally valid, i.e., for all samples of the AA source class? What is the minimum number of components needed to reproduce the effect of a targeted AA type on any source class sample?

For this, we quantify the information distribution across channel dimensions (see Sect. 3.2) of \(\tilde{\delta }\) with PCA decomposition and assess the cumulative variance explained by obtained components. We determine the minimum part of PCA components (relative to the total number of components in a layer) needed to preserve a defined fraction of the variance cumulatively. The total number is equal to the number of convolutional filters in each discovery layer (see Table 1).

In Table 2, we present the mean and standard deviations of these ratios, expressed as percentages (%). Across selected layers (Table 1 “Concept Discovery”) of all networks and attack types, we observe an uneven distribution of information among components: depending on the model and attack type, (1) 0.3% to 11.5% of components are sufficient to retain 50% of the whole variance; and (2) preserving 7.6% to 58.3% of components, we can explain 90% of variance. (3) In selected layers with a larger amount of filters, like in MobileNet (960) and ResNet50 (2048), 99% of variance is explained by 27.1% (C &W of ResNet50) to 77.9% (BIM, PGD of MobileNet) of components, whereas in models with fewer filters (VGG, SqueezeNet, ResNet18: 512 filters in all selected layers) 99% of variance is explained by 66.9% (PATCH of VGG) to 91.5% (BIM, PGD of ResNet18).

Table 2. Mean and standard deviation ratios (in %) of preserved PCA components relative to the total number of components in layer (#filters) required to retain a specified amount of AA variance in latent space.

BIM and PGD attacks create larger perturbations and consistently require the highest portions of components to explain \(\tilde{\delta }\) variance properly. In contrast, the adversarial patch (PATCH) attack necessitates the smallest portion of components. The behavior of the C &W attack varies across models: in some cases, its results align with those of BIM and PGD, while in others, they resemble the outcomes of the PATCH attack.

Summary: The majority of latent perturbation \(\tilde{\delta }\) information is concentrated in a small subset of globally valid linear components (concepts), i.e., perturbations \(\tilde{\delta }\) by a given AA for a chosen source and target class are built from few meaningful directions in the feature space.

Concept Discovery in Adversarial Perturbations. From the previous experiments we know that latent space perturbations originating from a given attack do give rise to a global linear decomposition. For the discovery of concepts in \(\tilde{\delta }\), we next utilize NMF, as it learns more meaningful directions than other decomposition methods [54]. To ensure the non-negativity constraint in NMF, like in ICE [54], we apply \(\tilde{\delta }^+ = \texttt {ReLu}(\tilde{\delta })\). The application of ReLU resulted in a negligible information loss (2–12%) across different models and attack types, ensuring the extracted concepts captured the majority of the perturbation information. Following the application of NMF (see Sect. 3.2) to \(\tilde{\delta }^+\), we obtain \(M^+ \in \mathbb {R}^{k \times c}\). For ease of notation let \(m_i={M_i^+}/{\Vert M_i^+\Vert }\) be the normalized i-th concept vector. Given the concepts, our next question is:

Research Question 4: Is a single linear component of an attacks adversarial perturbations in latent space sufficient to reproduce the attack on any source class sample? In other words: Can a linear translation in latent space describe an adversarial attack type for a given source and target class?

Fig. 5.
figure 5

The dependency of original and target class probabilities on the adversarial perturbation magnitude \(\gamma \) averaged for all test samples. Original adversarial perturbation ( ) is compared to perturbation projected onto 3 concepts discovered with NMF ( ). (Color figure online)

Recall that the m-component of \(\tilde{\delta }\) for a vector m (i.e., the amount by which \(\tilde{\delta }\) points into direction m) is:

$$\begin{aligned} \text {proj}_{m} \big ( \tilde{\delta } \big ) {:}{=}\big (\tilde{\delta } \circ m\big ) m. \end{aligned}$$
(9)

We would now like to compare on the outputs the effect of slowly applying \(\tilde{\delta }\) with the effect of slowly applying any of its components \(m_i\). Slow application is here realized by linearly interpolating between \(\tilde{x}{:}{=}f_{\rightarrow L}(x)\) and \(\tilde{x} + \tilde{\delta } = f_{\rightarrow L}(x+\delta )\) via \(\tilde{x} + \gamma \tilde{\delta }\), \(\gamma \in [0,1]\). To ensure comparability of the linear interpolations, we interpolate each \(m_i\) via \(\tilde{x}+\gamma \text {proj}_{m_i}(\tilde{\delta })\). This ensures that at most \(\tilde{\delta }\) is applied to \(\tilde{x}\). Finally, we estimate and compare the influence of the interpolations on prediction confidences of original/target class \(\text {cls}\) for components \(m_i, i\le k\) at the same \(\gamma \in [0,1]\):

$$\begin{aligned} f_{L\rightarrow }\left( \tilde{x} + \gamma \,\tilde{\delta }\right) _{\text {cls}} \quad \text {versus}\quad f_{L\rightarrow }\left( \tilde{x} + \gamma \,\text {proj}_{m_i}(\tilde{\delta })\right) _{\text {cls}}. \end{aligned}$$
(10)

When \(\gamma =0\), the prediction is made for the original image, and at \(\gamma =1\), for the image attacked by full perturbation respectively its full ith component.

Fig. 6.
figure 6

Similarity heatmaps of concepts (3 per attack) discovered in adversarial perturbations with NMF. Concepts are denoted as AttackType-ConceptId pairs (Attack types: \( \mathrm {B = BIM, P = PGD, C = C \& W, PA = Patch Attack}\)).

In Fig. 5, we display dependency graphs depicting the averaged—for all related test samples—true class y (solid lines) and attacker target class \(y'\) (dashed lines) output confidence corresponding to the magnitude \(\gamma \) of \(\tilde{\delta }\) ( ) and \(k=3\) NMF component projections ( ). Results are shown for every techniques of attacking \(\texttt {pickup} \rightarrow \texttt {banana}\) and for three models (remainder skipped due to space constraints; behavior was similar).

Walking Towards Original Perturbation: In the case of original perturbation \(\tilde{\delta }\) ( lines), we observe that (1) gradient attacks, which create large perturbations (PGD, BIM), reach target confidences \(f_{L\rightarrow }(\tilde{x} + \gamma \,\tilde{\delta })_{\text {target}} \approx 1.0\) at \(\gamma = 1\) and maintain this value thereafter, while (2) the low perturbation attack (C &W) does not reach a value close to 1.0; however, it can be further amplified by increasing \(\gamma > 1\). (3) PATCH attack behaves similarly to C &W, but in some cases the target confidences again drop at high \(\gamma \) values (see VGG). In other models and origin-target combinations of AAs, the observed behavior was consistent.

Fig. 7.
figure 7

Similarity heatmaps of concepts (3 per attack) discovered in adversarial perturbations with NMF. Concepts are denoted as AttackType-ConceptId pairs (Attack types: \( \mathrm {B = BIM, P = PGD, C = C \& W, PA = Patch Attack}\)).

Walking Towards Linear Perturbation Components: For \(k=3\) concepts discovered with NMF decomposition, we observe that (1) one or several NMF components (typically the two most prominent ones and ) in each of the demonstrated cases lead to successful attacks, i.e., increasing the target class confidences and decreasing original class ones. However, the decrease of original and growth of target class confidences are slower than these for \(\tilde{\delta }\) ( ). (2) Discovered concepts behave differently; the component usually contributes positively to original class y probability, reinforcing it. This may be caused by partial information loss resulting from the NMF non-negativity constraint. The observed behavior of NMF concepts is consistent across all networks and adversarial attack origin-target combinations. We observed a similar behavior for concepts discovered in PCA; however, the impact of \(\gamma \) was weaker, meaning that the decrease in y and the growth in \(y'\) were slower. Due to this similarity, we do not present visual results.

Summary: A given adversarial attack can be replaced by walking in latent space into the direction of the most prominent linear component(s) of its adversarial perturbations.

Similarity of Concepts Discovered in Adversarial Perturbations. Considering the observed similarity in the behavior of perturbation components ( ) and of the original perturbation \(\tilde{\delta }\) ( ) across all test cases, it is a valid assumption that we detected similar concepts in tested AAs, possibly with varying magnitude depending on the attack technique. In other words:

Research Question 5: Do different adversarial attack techniques comprise concept vectors of similar directions in the feature space?

Fig. 8.
figure 8

Similarity of all concepts discovered with NMF in adversarial perturbations aiming “taxi” class in layer4.2 of ResNet50. Concepts are denoted as OriginClassId-AttackType-ConceptId pairs (Attack types: \( \mathrm {B = BIM, P = PGD, C = C \& W}\)).

To investigate the similitude of concept vectors discovered in \(\tilde{\delta }\) across different attack techniques we estimate cosine similarities (see Eq. 7) for each pair of vectors. In Figs. 6 and 7 we present the similarity clustermapsFootnote 4 (clustered heatmaps) of concept vectors \(M^+\) discovered with NMF for all tested attack techniques and different origin-target attack pairs.

Similarities Across Attack Techniques: We observe that (1) concepts discovered in different attack techniques for the same are similar (co-directed) and form clusters (visible as groups of brighter pixels). For instance, PA1, C1, B1, and P1 group of ResNet18 (for attacking \(\texttt {zebra} \rightarrow \texttt {banana}\)) contains one vector of each category. (2) At least one of such groups is observed per clustermap. Similar behavior was observed in other origin-target attack pairs not visualized here. (3) Adversarial patch attack vectors are the most distinct, as evident in the \(\texttt {zebra} \rightarrow \texttt {banana}\) column. This distinction can be attributed to the different nature of the attack compared to the other gradient-based attack techniques.

Similarities Across Origin Classes: We further expand the pairwise concept vector comparison to vectors originating from any category, only fixing the attack target class. In other words, we investigate whether the learned directions of concept vectors \(M^+\) discovered with NMF are purely target-specific, respectively in how far they are agnostic to attack type and original true class. In Fig. 8 we showcase a concept similarity clustermap for ResNet50, where concept vectors are targeting the taxi class (\(\texttt {any} \rightarrow \texttt {taxi}\)). Similar to notations in Fig. 7, rows and columns are indexed as OriginClassId-AttackType-ConceptId. Here, the additional OriginClassId represents the ImageNet class ID of the AA origin. From such results of target-specific comparison of concept vectors, we observe two large groups of similar (co-directed) concepts (approximately \(20 \times 20\) bright pixels each) and several smaller groups (around \(3 \times 3\) bright pixels) along the main diagonal. The large groups include concept vectors of attacks originating from all tested supercategories (vehicle, animal, and fruit) and targeting taxi class. Similar results were observed for different attack targets and models: at least one large cluster of concept vectors was observed.

Summary: Our results suggest that adversarial attacks comprise concept vectors of target-specific direction (albeit with varying magnitudes depending on the AA perturbation strength). In particular, AAs can be characterized by directions in feature space that are mostly agnostic to attack technique and attack origin class.

6 Conclusion and Outlook

This work conducted an in-depth analysis of the influence of AAs on the concepts learned by CNNs using concept-based XAI techniques. Through experiments, our results revealed that AAs induce substantial alterations in the concept composition within the feature space, introducing new concepts or modifying existing ones. Remarkably, we demonstrated that the adversarial perturbation itself can be decomposed into a set of global concepts, i.e., linear components shared amongst all adversarial perturbations, a subset of which is sufficient to reproduce the attack’s success. Furthermore, we discovered that different AAs learn similar concept vectors and that these vectors might only be specific to the attack target class irrespective of the origin class.

These findings provide valuable insights into the nature of AAs and their impact on learned representations, paving the way for the development of more robust and interpretable deep learning models and effective defenses against adversarial threats.

Limitations: Our experiments focused on pairwise training scenarios and targeted white-box attacks. Extending the analysis to non-targeted attacks, further black-box attacks, and universal attacks, as well as other model types, datasets, and class combinations, is needed in the future to broaden our understanding of AAs’ impact on learned concepts. Also, in order to ensure precision when using linear components for adversary detection, one should check whether/how much adversary components are shared with non-adversarial perturbations and domain shifts, such as Gaussian Noise; and how well the different effects can be differentiated.

Future Work: It will be interesting to see whether the concept-level patterns and alterations induced by AAs are useful for (real-time) detection of adversaries. Additionally, our findings on how to globally represent AAs as linear shifts in latent space could inform the design of adversarial elimination techniques that aim to remove or mitigate the impact of adversarial concepts within the learned representations.

Another intriguing direction is the explainable design of AAs themselves. By leveraging our understanding of the target-specific nature of adversarial concepts, novel attack strategies that optimize for specific concept directions could be developed, potentially leading to more effective and robust adversarial examples—which again can serve in adversarial training for creating super-robust models.

A potential next iteration of our approach is the exploration of non-linear AAs that manipulate the interactions between concepts rather than solely targeting individual concept directions. Such attacks could potentially circumvent existing defenses and uncover important challenges to the robustness of deep learning models.

In conclusion, our study has demonstrated the value of leveraging explainable AI techniques to gain insights into the impact of AAs on learned representations and concepts within deep neural networks. By bridging the gap between adversarial robustness and CNN latent space interpretability, we hope to have paved the way towards more reliable and trustworthy AI systems capable of withstanding adversarial threats whilst providing transparent and interpretable decision-making processes.