1 Introduction

In recent years, Deep Learning has exhibited remarkable progress in addressing the diverse challenges in the field of Computer Vision (CV) [21], such as image classification [24, 79], object detection [77, 80], and semantic segmentation [39, 50]. This success can be largely attributed to the powerful hierarchical representations learned by DNNs, which are capable of capturing intricate patterns and features within visual data [7]. However, a prevailing concern remains the inherent obscurity of the learned representations within these models. As a consequence, DNNs are often referred to as “black-box” systems, since their internal mechanisms are not readily interpretable or comprehensible to human observers.

While being powerful in capturing intricate patterns within the data, DNNs are susceptible to learning spurious correlations—coincidental relationships, often driven by an unobserved confounding factor, which may lead the model to identify and rely on such misleading patterns [31, 68]. Model dependence on such artifactual features could lead to poor generalization performance of the model on different datasets and pose a substantial risk in the case of safety-critical areas. As such, the identification and subsequent mitigation of these spurious correlations within models are crucial for the development of robust and trustworthy Computer Vision systems.

In this work, we propose a new method called Function-Semantic Contrast Analysis (FSCA)Footnote 1, which aims to identify spurious correlations between the neural representations (i.e. neurons) given that target concepts of each representation are known. The proposed approach is based on the idea of analyzing the contrast between two distinct metrics: the functional distance, which measures the relationships between representations based on the correlations in activation patterns, and the knowledge-based semantic distance between concepts, these representations were trained to encode. Hypothesizing that spurious correlations frequently arise between semantically distant classes due to the influence of an unobserved factor, FSCA analyzes the contrast between these two distance measures, ultimately identifying potentially spurious pairs with high disagreement between metrics. FSCA offers a scalable approach that considerably reduces dependence on human supervision, thereby providing a robust means for the comprehensive evaluation of “black-box” models.

2 Related Works

To address the problem of the opacity of Deep Neural Networks given their widespread popularity across various domains, the field of Explainable AI (XAI) has emerged [27, 36, 49, 61, 62]. The primary goal of XAI is to provide insights into the decision-making processes of complex AI systems, allowing humans to comprehend, trust, and effectively manage these systems [67]. One important class of explainability approaches, known as post-hoc explainability methods, seeks to explain the decision-making processes of trained models without interfering in their training procedure [38, 70]. These methods can be broadly classified into two types based on the scope of their analysis: local explanation methods, which focus on explaining model decisions for specific inputs, and global explanation methods, which aim to interpret general decision-making strategies, allowing for audits and investigations of models across diverse populations and shedding light on the roles of various model components.

Local explanation methods often provide explanations to the decision-making process on a given data point in the form of attribution maps, distributing relevance scores among features of the input, emphasizing the most critical attributes for the prediction. Various methods, such as Layer-wise Relevance Propagation (LRP) [3], GradCAM [64], LIME [58], Integrated Gradients [73], and SHAP [45] have been introduced, and have proven to be effective in explaining the decision-making process in Computer Vision models [15, 75], including Bayesian Neural Networks [13, 19]. To tackle the interpretability issue of attribution maps, several enhancing techniques were introduced, such as SmoothGrad [71], NoiseGrad [18], and Augmented GradCam [51]. Significant focus has been devoted to examining and assessing the effectiveness of local explanation techniques [30, 33, 34]. However, the primary limitation of local explanation methods lies in their ineffectiveness in probing the unknown behaviors of the models. While they prove beneficial for examining existing, known hypotheses, they are ineffective when it comes to uncovering unknown hypotheses, including the identification of unknown spurious correlations and shortcuts [1].

Conversely, global explanation methods aim to interpret the general decision-making strategies employed by the models by shedding light on the roles of specific components such as neurons, channels, or output logits, which are often referred to as representations. Such approaches enable a more general insight into the decision-making strategies of the models, thus facilitating the discovery of unknown and unexpected behavior within these models. Methods such as Network Dissection [4, 5], Compositional Explanations of Neurons [52], and MILAN [35] have been developed to explain the functionality of these latent representations by associating them with human-comprehensible concepts. Activation Maximization Methods [25], on the other hand, aim to explain the concept behind the model’s representations by identifying the inputs that maximally activate a particular neuron or layer in the network and hence, visualize the features that have been learned by the specific representation. These activation maximization images, also referred to as signals, embody the features that the representations have learned to detect and they could be either sampled from an existing dataset [11, 17] or generated artificially through an optimization procedure [53,54,55].

2.1 Spurious Correlations in Computer Vision Models

While excelling at various Computer Vision tasks by being able to learn complex and intricate representations of the data, Deep Neural Networks are susceptible to learning spurious correlations from data. Such correlations represent apparently related variables that, upon closer inspection, reveal a connection rooted in mere coincidence or an underlying, often obscured, factor [31, 68]. This phenomenon, commonly referred to as “Shortcut learning” or “Clever-Hans effect” [2, 42, 76], often manifesting in a strong contrast between desired and actual learning strategy within the model. In the following discussion, we provide a general overview of these correlations, which have been identified across a range of CV tasks.

Co-occurring Objects. In the field of image classification, as well as in other CV subdomains, images often contain not only the primary object of interest but also secondary objects in the background. Deep Neural Networks (DNNs) trained on such data can establish associations between the primary objects and frequently occurring secondary objects. Examples of such learned correlations could be fingers and band-aids [69], trains on tracks, or bees and flowers [40]. [66] observed that classifiers heavily depend on the context in which objects are situated, performing poorly in less common contexts, e.g., the absence of a typical co-occurring object. In experiments using the MS COCO dataset [44] it was discovered that classifiers for specific classes, like “Keyboard”, “Mouse”, and “Skateboard”, are highly sensitive to contextual objects and exhibit poor performance when encountered outside their usual context, such as for instance, keyboards often go unrecognized without a nearby monitor [66].

Object Backgrounds. A prevalent type of spurious correlation arises between background features and target labels [78]. For example, a classifier may rely on a snowy background to identify huskies in images, instead of focusing on the target feature “Husky” [58]. Such correlations could stem from selection bias in training datasets, as demonstrated by the Waterbirds dataset [60], where the target label (“Waterbird” or “Landbird”) is spuriously correlated with background features (water or land) in most training images [29]. Another example involves classifying cows and camels [6], where the target label (“Cow” or “Camel”) is spuriously correlated with background features (green fields or desert) in most images. [69] identified numerous instances of background spurious features in ImageNet.

Biases and Stereotypes. Racial and gender biases stand as notable examples of undesired behavior, leading to adverse real-world consequences, particularly for marginalized groups. These biases can materialize in various ways, such as underdiagnosis in chest radiographs among underrepresented populations [65], or racially discriminatory facial recognition systems that disproportionately misidentify darker-skinned females [16]. Researchers have found instances of racial, religious, and Americentric biases embedded in the representations of the CLIP model [57]. Generative models like Stable Diffusion [59] also exhibit biases that perpetuate harmful racial, ethnic, gendered, and class stereotypes [8]. Other harmful spurious correlations have been found, such as associations between skin tone and sports or gender and professions [72, 82].

Artifacts. Spurious correlations often arise from the presence of artifacts in images across various classes. These artifacts are secondary objects that hold no semantic connection with the primary class, and their coincidental association is unnatural and irregular. For instance, in the ImageNet dataset, Chinese watermarks have been found to influence numerous classes, such as “carton”, “monitor”, “broom”, “apron” and “safe” [20], resulting in up to 26.7% drop in performance when watermarks are added to every class in the validation dataset [43]. This phenomenon has also been observed in the PASCAL VOC 2007 dataset [26], where a photographer’s watermark frequently appears in images from the “horse” class [42]. Consequently, the trained model inadvertently learns this association, affecting its overall performance. Additionally, spurious correlations caused by artifacts have been noted in medical applications, such as skin lesion detectors, where artifacts like rulers and human-made ink markings or stains are present [10]. Similarly, hospital-specific metal tokens in chest X-ray scans [28, 81] and radiologist input artifacts in brain tumor MRI classification [76] have been found to impact the accuracy of these applications.

2.2 Finding and Suppressing Spurious Correlations

The primary challenge in identifying spurious correlations stems from the lack of a concrete definition or criteria that differentiate them from “permissible” correlations. This ambiguity is reflected in the majority of methods’ reliance on extensive human oversight. Spectral Relevance Analysis (SpRAy) was designed to aid in the identification of spurious correlations by clustering local attribution maps for future manual inspection [2, 42]. However, the dependence on local explanations restricts the range of spurious correlations identified to basic, spatially static artifacts. This limitation necessitates a significant amount of human supervision and tailoring of the method’s various hyperparameters to suit the specific problem at hand, which subsequently constrains the detection of unknown, unexpected correlations. The Data-Agnostic Representation Analysis (DORA) method approaches the problem of spurious correlations from a different angle, by analyzing relationships between internal representations [17]. The authors introduced the functional Extreme-Activation distance measure between representations, demonstrating that representations encoding undesired spurious concepts are often observed to be outliers in this distance metric.

A subsequent challenge is to revise or update the model after identifying spurious correlations. The Class Artifact Compensation framework was introduced, enabling the suppression of undesired behavior through either a fine-tuning process or a post-hoc approach by injecting additional layers [2]. An alternative method involves augmenting the training dataset after uncovering an artifact, so that the artifact is shared among all data points, rendering it an unusable feature for recognition by the model [43]. To suppress spurious behavior in transfer learning, a straightforward method was proposed to first identify representations that have learned spurious concepts, and then, during the fine-tuning phase, exclude these representations from the fine-tuning process [20].

Fig. 1.
figure 1

Various cases of function-semantic relationship. This figure illustrates four primary scenarios of potential relationships between functional and semantic distances. In our analysis, we mainly focus on instances where representations exhibit high functional similarity, while the concepts they were trained to detect differ semantically, illustrated in the first quadrant of the figure.

2.3 Visual-Semantic Relationship in Computer Vision

In the field of Computer Vision, both visual and semantic similarities play crucial roles in the comprehension and interpretation of images and their underlying concepts. Visual similarity refers to the resemblance between two images based on their appearances, whereas semantic similarity denotes the extent of relatedness between the meanings or concepts associated with the images. A widely accepted definition of semantic similarity takes into account the taxonomical or hierarchical relationships between the concepts [32]. There is a general observation that semantic and visual similarities tend to be positively correlated, as an increase in semantic similarity between categories is typically accompanied by a rise in visual similarity [14, 23]. DNNs trained on Computer Vision tasks demonstrate the ability to indirectly learn class hierarchies [9].

3 FSCA: Function-Semantic Contrast Analysis

In this work, we propose a novel method called Function-Semantic Contrast Analysis (FSCA). This method allows to identify pairs of output representations that may possess spurious associations. FSCA capitalizes on the functional distance between representations, which can be calculated using the activations of representations on the given dataset, and the knowledge-based semantic distance between concepts, obtained from taxonomies or other knowledge databases. By examining the contrast between the two distance metrics, our primary focus lies in revealing pairs of representations that exhibit a high degree of functional similarity but whose underlying concepts are semantically very different, i.e., which are located in the first quadrant of Fig. 1. While disagreements between functional and semantic distances are often natural, as some concepts may share visual similarity while remaining semantically distinct [14], we observe that such behavior frequently results from undesired correlations present in the training data.

3.1 Method

Let us consider a neural network layer \(\mathcal {F} = \{f_1, ..., f_k\}\), consisting of k distinct functions, \(f_i(x): \mathbb {D} \rightarrow \mathbb {R}, \forall i \in {1, ..., k}\), referred to as neural representations, that are mappings from the data domain \(\mathbb {D}\) to the activation of the i-th neuron in the layer. We further assume that the concepts associated with each representation are known, and define a set of concepts \(\mathcal {C} = \{c_1, ..., c_k\}\), where \(c_i\) denotes the concept underlying the representation \(f_i(x), \forall i \in {1, ..., k}\). Thus, we can define the set \(\mathcal {P} = \{(f_1, c_1), \dots , (f_k, c_k)\} \subset \mathcal {F}\times \mathcal {C}, \) as a collection of representation-concept pairs.

We consider that two distance metrics, \(d_{\mathcal {F}}\) and \(d_{\mathcal {C}}\), that are defined on the respective sets \(\mathcal {F}\) and \(\mathcal {C}\): \( d_{\mathcal {F}}: \mathcal {F} \times \mathcal {F} \rightarrow \mathbb {R}, \quad d_{\mathcal {C}}: \mathcal {C} \times \mathcal {C} \rightarrow \mathbb {R}, \) where \(d_{\mathcal {F}}\) is measuring the functional distance between learned representations in the networks, and \(d_{\mathcal {C}}\) measures the semantic distance between the concepts these representations are trained to encode. Accordingly, we define two \(k \times k\) distance matrices, F and C, as follows:

$$\begin{aligned} F = \begin{bmatrix} d_{\mathcal {F}}(f_1, f_1) &{} \dots &{} d_{\mathcal {F}}(f_1, f_k) \\ \vdots &{} \ddots &{} \vdots \\ d_{\mathcal {F}}(f_k, f_1) &{} \dots &{} d_{\mathcal {F}}(f_k, f_k) \end{bmatrix}, \quad C = \begin{bmatrix} d_{\mathcal {C}}(c_1, c_1) &{} \dots &{} d_{\mathcal {C}}(c_1, c_k) \\ \vdots &{} \ddots &{} \vdots \\ d_{\mathcal {C}}(c_k, c_1) &{} \dots &{} d_{\mathcal {C}}(c_k, c_k) \end{bmatrix}. \end{aligned}$$

Given two neural representations \(f_i, f_j \in \mathcal {F}\) with corresponding concepts \(c_i, c_j \in C,\) one can assess the contrast between functional and semantic distance by comparing the values between \(d_{\mathcal {F}}(f_i, f_j)\) and \(d_{\mathcal {C}}(c_i, c_j).\) However, such an approach might not be optimal since functional and semantic distance measures can possess distinct scales. To overcome this challenge, we suggest a non-parametric approach, where the ranks of distances within their corresponding distributions are analyzed instead.

In the following, a collection of unique distances are collected from the upper triangular portion of the distance matrices, including the main diagonal and all elements above it:

$$\begin{aligned} F_{\varDelta } &= \left\{ d_{\mathcal {F}}(f_i, f_j) \mid \forall i \in \{1, \dots , k\}, \forall j \in \{i, \dots , k\}\right\} ,\end{aligned}$$
$$\begin{aligned} C_{\varDelta } &= \left\{ d_{\mathcal {C}}(c_i, c_j) \mid \forall i \in \{1, \dots , k\}, \forall j \in \{i, \dots , k\}\right\} . \end{aligned}$$

We define matrices \(F^*, C^*\) as

$$\begin{aligned} F^* = \begin{bmatrix} d^*_{\mathcal {F}}(f_1, f_1) &{} \dots &{} d^*_{\mathcal {F}}(f_1, f_k) \\ \vdots &{} \ddots &{} \vdots \\ d^*_{\mathcal {F}}(f_k, f_1) &{} \dots &{} d^*_{\mathcal {F}}(f_k, f_k) \end{bmatrix}, \quad C^* = \begin{bmatrix} d^*_{\mathcal {C}}(c_1, c_1) &{} \dots &{} d^*_{\mathcal {C}}(c_1, c_k) \\ \vdots &{} \ddots &{} \vdots \\ d^*_{\mathcal {C}}(c_k, c_1) &{} \dots &{} d^*_{\mathcal {C}}(c_k, c_k) \end{bmatrix}, \end{aligned}$$

where \(\forall i,j \in \{1, \dots , k\}\)

$$\begin{aligned} d^*_{\mathcal {F}}(f_i, f_j) = \textrm{cdf}^{-1}_{F_{\varDelta }}\left( d_{\mathcal {F}}(f_i, f_j)\right) , \quad d^*_{\mathcal {C}}(c_i, c_j) = \textrm{cdf}^{-1}_{C_{\varDelta }}\left( d_{\mathcal {C}}(c_i, c_j)\right) , \end{aligned}$$

and \(\textrm{cdf}^{-1}\) correspond to the inverse of the cumulative distribution function (percentile).

Finally, for every pair of neural representations we define the function-semantic contrast score based on the difference between the percentile of the functional distance, and the percentile from the semantic distance between corresponding concepts.

Definition 1

Given \(\mathcal {P} = \{(f_1, c_1), \dots , (f_k, c_k)\} \subset \mathcal {F}\times \mathcal {C},\) as a collection of representation-concept pairs, corresponding to the outputs of a DNN and two metrics \(d_{\mathcal {F}}, d_{\mathcal {C}}\) defined on \(\mathcal {F}, \mathcal {C},\) respectively. Furthermore, let \(F_{\varDelta }, C_{\varDelta }\) be a collection of unique distances among neural representations and concepts, respectively. For \(p_i, p_j \in \mathcal {P}\) we define contrast score as

$$\begin{aligned} \textrm{fsc}\left( p_i, p_j\right) = \textrm{cdf}^{-1}_{C_{\varDelta }}\left( d_{\mathcal {C}}(c_i, c_j)\right) - \textrm{cdf}^{-1}_{F_{\varDelta }}\left( d_{\mathcal {F}}(f_i, f_j)\right) . \end{aligned}$$

Contrast scores range from −1 to 1, with high contrast scores indicating cases where representations display significant functional similarity, while the underlying concepts are semantically distinct. This particular type of function-semantic relationship is our primary focus and is illustrated in Fig. 1.

In practice, to detect spurious correlations within the output representations, each pair of representations is assigned a contrast score, and pairs are sorted in descending order. Pairs with the highest contrast scores highlight the discrepancy between the model’s perception and the human-defined semantic distance. Subsequently, each pair can be manually investigated by a human to determine the causal reason for such contrast.

3.2 Selecting a Distance Metric Between Representations

A crucial aspect of our proposed method’s performance lies in the choice of an appropriate distance metric for the comparison of the output representations, which must reflect the similarity in activation patterns between pairs of representations within the network. Consider the dataset \(\mathcal {D} = \{x_1, \dots , x_N\} \subset \mathbb {D},\) consisting of N independent and identically distributed data points from the data distribution. For a layer \(\mathcal {F}\) with k representations, we define vector \(A_i = ({f_i(x_1), \dots , f_i(x_N)}) \subset \mathbb {R}^N, \forall i \in \{1, \dots , k\},\) which contains the activations of the i-th representation across the dataset. We assume that all vectors \(A_i, \forall i \in \{1, \dots , k\}\) are standardized, with a sample mean of 0 and a standard deviation of 1.

Our approach permits flexibility in choosing the distance metric between representations. In this work, we utilize the Extreme-Activation (EA) distance metric, derived from the analysis of natural data [17]. Drawing inspiration from the study of Activation-Maximization signals (AMS), which are data points that maximally activate a given representation, the EA distance quantifies the extent to which two representations are activated by each other’s AMS. This provides insights into how the representations are influenced by the features present in the AMS.

To calculate the pair-wise Extreme-Activation distance, the dataset \(\mathcal {D}\) is partitioned into n disjoint blocks, \(D = \bigcup _{i = 1}^n D_t, D_t = \left\{ x_{td+1},..., x_{(t+1)d+1}\right\} , \forall t \in \{0, ..., n-1\}\), each of length d. Subsequently, for each representation \(f_i \in \mathcal {F}, \forall i \in \{1, \dots , k\}\), we define a set of natural Activation-Maximization signals (n-AMS) as \(S_i = \left\{ s^i_1,..., s^i_n\right\} ,\) where

$$\begin{aligned} s^i_t = \mathop {\mathrm {arg\,max}}\limits _{x\in D_t} f_i\left( x\right) , \forall t \in \{0, ..., n-1\}. \end{aligned}$$

For every two representations \(f_i, f_j \in \mathcal {F}\), we define their pair-wise representation activation vectors (RAVs) \(r_{ij}, r_{ji}\) as:

$$\begin{aligned} r_{ij} = \begin{pmatrix} \frac{1}{n}\sum _{t=1}^n f_i\left( s^i_t\right) \\ \frac{1}{n}\sum _{t=1}^n f_j\left( s^i_t\right) \end{pmatrix}, \quad r_{ji} = \begin{pmatrix} \frac{1}{n}\sum _{t=1}^n f_i\left( s^j_t\right) \\ \frac{1}{n}\sum _{t=1}^n f_j\left( s^j_t\right) \end{pmatrix}. \end{aligned}$$

Subsequently, we define the pair-wise Extreme-Activation distance between two representations as the cosine of the angle between their corresponding RAVs.

Definition 2 (Extreme-Activation distance)

Let \(f_i, f_j\) be two neural representations, and \(r_{ij}, r_{ji}\) be their pair-wise RAVs. We define a pair-wise Extreme-Activation distance as

$$\begin{aligned} d_\mathcal {F}\left( f_i, f_j\right) = \frac{1}{\sqrt{2}}\sqrt{1 - \cos \left( r_{ij}, r_{ji}\right) }, \end{aligned}$$

where \(\cos (A, B)\) is the cosine of the angle between vectors AB.

Extreme-Activation distance quantifies the activation of n-AMS between two representations, offering a valuable metric for examining the relationships among intricate non-linear functions [17]. In contrast to other metrics, such as Pearson correlation, the EA distance utilizes a small subset of n-AMS for each representation, enabling a straightforward visual inspection of Activation-Maximization signals. This metric, grounded in the measure of how two representations are co-activated on their most activating signals, allows practitioners to easily discern the shared visual features between two sets of n-AMS.

Figure 2 demonstrates the EA distance between two representations, corresponding to the “Snow leopard” and “Crossword puzzle” classes, derived from an ImageNet [22] pre-trained DenseNet161 network [37]. This figure enables an effortless assessment of the functional similarity between the two representations. We can observe that the RAVs are not perpendicular, implying a functional dependence between the representations. Moreover, a visual inspection of the n-AMS for both representations reveals a similar black-and-white texture pattern that both representations have learned to detect.

Fig. 2.
figure 2

Interpreting the Extreme-Activation Distance. Given two output representations from the DenseNet161 network, \(f_i\) and \(f_j\), corresponding to the “Snow leopard” and “Crossword puzzle” classes respectively, two sets of n-AMS signals were sampled (orange and blue, respectively). The left figures display the distribution of activations for both representations across the ImageNet-2012 validation dataset, with the positions of the n-AMS indicated. The right figure presents the pair-wise RAVs, alongside activation of all data points (gray) and representations-specific n-AMS (blue, orange). The EA distance measures the cosine of the angle between RAVs vectors. (Color figure online)

EA distance measure varies between 0 and 1. Low values correspond to small angles between RAVs, indicating that both representations are highly activated on each other’s AMS. Perpendicular RAVs, which represent cases where the representations are indifferent to each other’s AMS, yield a distance equal to \(\frac{1}{\sqrt{2}} \approx 0.7071\). Higher EA distance signifies situations where the n-AMS of the representations negatively affect one another, meaning the AMS of one representation deactivates the other.

3.3 Selecting a Distance Metric Between Concepts

The choice of functional and semantic distances between representations and concepts, respectively, is critical. Semantic distance should encapsulate human-defined relationships, particularly ensuring that these distances do not rely on spurious or undesired correlations. Function-Semantic Contrast Analysis (FSCA) can utilize any concept metric, including expert-defined knowledge-based distance measures. For example, semantic distances can be derived from the WordNet database [48], which groups English words into synsets connected by semantic relationships.

In this work, we employ the Wu-Palmer (WUP) distance metric defined on the WordNet taxonomy database. The WUP distance is based on the depth of the least common subsumer (LCS), which is the most specific synset that is an ancestor of both input synsets [56]. The WUP distance computes relatedness by considering the depth of the LCS and the depths of the input synsets in the hierarchy.

Definition 3

Let \(c_i, c_j \in \mathcal {C},\) be two concepts, and let \(w_i, w_j\) be the corresponding synsets from the WordNet taxonomy database. The Wu-Palmer distance is defined as:

$$\begin{aligned} d_{\mathcal {C}}(c_i, c_j) = 1 - 2\frac{l(r, lcs(w_i, w_j))}{l(r, w_i) + l(r, w_j)}, \end{aligned}$$

where lcs(xy) is the Least Common Subsumer [56] of two synsets x and y, r is the taxonomy root, and l(xy) is the length of the shortest path between WordNet synsets xy.

The WUP distance takes into account the specificity of the common ancestor, rendering it more robust to the structure of the WordNet hierarchy in comparison to other semantic distance metrics such as Shortest-Path distance or Leacock-Chodorow [17, 63]. Moreover, the Wu-Palmer distance offers a more fine-grained measure of relatedness. Figure 3 demonstrates the Wu-Palmer distance between 1000 ImageNet classes that share natural connections to WordNet synsets. The structure of the semantic distance matrix (center) aligns with the location of the primary groups of classes within the dataset, as illustrated in the figure to the right.

Fig. 3.
figure 3

Illustration of functional and semantic distance matrices. From left to right: EA distance between DenseNet161 output representations, Wu-Palmer distance between 1000 ImageNet classes, and a visualization of the location of several highly interesting hyperclasses within the distance matrices.

4 Experiments

This section provides a detailed examination of various implemented experiments. These include an evaluation of the performance of the FSCA method in light of the given ground truth. Furthermore, we explore the practical application of FSCA to the widely-employed DenseNet-161 model. Finally, we conduct a broad assessment of ImageNet-trained models, focusing on the relationship between performance and the functional similarities between representations.

4.1 Evaluation Given the Ground Truth

To evaluate the effectiveness and suitability of the proposed methodology, we investigated its capability to identify instances of representation pairs previously acknowledged to exhibit spurious correlations. This analysis utilized two ImageNet-trained models, specifically GoogLeNet [74] and DenseNet-161 [37], both previously reported to possess a significant proportion of output representations susceptible to watermark text detection [20].

Consider \(\mathcal {P}_G, \mathcal {P}_D\) as collections of representation-concept pairs for 1000 output representations - essentially, the pre-softmax output logit representations from the two networks. For each of these models, employing a technique akin to that described in [20], we identified subsets \(\mathcal {Z}_G \subset \mathcal {P}_G, \mathcal {Z}_D \subset \mathcal {P}_D\) of representation-concept pairs with high discriminatory capability (AUROC > 0.9) towards watermarked images, implying that such representations exhibit spurious correlations towards watermarked images and generally assign higher activations to the images, where the watermark is present. For GoogLeNet there were found \(|\mathcal {Z}_G| = 21\) output representations, such as “carton”, “broom”, “apron” and others, while for DenseNet-161 there were found \(|\mathcal {Z}_D| = 22\) high-discriminatory representations. We applied FSCA to both sets \(\mathcal {P}_G\) and \(\mathcal {P}_D\) using the functional Extreme-Activation distance, computed over \(n = 10\) n-AMS with parameter \(d = 5000\). For the semantic distance between concepts, we chose the Wu-Palmer distance, considering the inherent link between ImageNet concepts and the WordNet taxonomy. After calculating Function-Semantic Contrast (FSC) scores for each pair of representations, we compared these scores between two groups: those pairs known to be susceptible to spurious correlations and the rest. More specifically, we defined two sets:

$$\begin{aligned} \text {FSC}_G^{-} &= \left\{ \text {fsc}(p_i, p_j) \mid \forall p_i, p_j \in \mathcal {P}_G, i > j, p_j \in \mathcal {P}_G \setminus \mathcal {Z}_G \right\} ,\end{aligned}$$
$$\begin{aligned} \text {FSC}_G^{+} &= \left\{ \text {fsc}(p_i, p_j) \mid \forall p_i, p_j \in \mathcal {Z}_G, i > j\right\} , \end{aligned}$$

where \(\text {FSC}_G^{-}\) denotes the set of FSC scores for representation-concept pairs from GoogLeNet, in which at least one representation was not identified as being susceptible to Chinese watermark detection. Conversely, \(\text {FSC}_G^{+}\) represents the FSC scores for the representation-concept pairs where both representations were recognized to be susceptible to spurious correlations. We similarly defined sets \(\text {FSC}_D^{-}\) and \(\text {FSC}_D^{+}\) for the DenseNet-161 model. GoogLeNet and DenseNet-161, 210 and 231 pairs of representations were respectively flagged as exhibiting spurious correlation, among a total of 499500 pairs.

Figure 4 visually presents the differences between the \(\text {FSC}^{-}\) and \(\text {FSC}^{+}\) distributions for both models. For each model, the FSC scores for “watermark” pairs, defined as pairs of representations where both classes were identified as susceptible to watermark detection, are consistently higher than those for other representation pairs. This observation was further corroborated by the Mann-Whitney U test [46] under a standard significance level (0.05).

If we constrain the FSCA analysis solely to representation pairs exhibiting substantial functional similarity, specifically those falling within the top 2.5% (\(d^*_{\mathcal {F}} \le 0.025\)), the results for GoogLeNet indicate 8 spurious pairs (out of 210) among the top 1000 pairs with the highest FSC, 38 within the top 5000, and 52 within the top 10000. Implementing the same methodology with DenseNet-161 yields no spurious pairs (out of 231) within the top 1000, 32 pairs within the top 5000, and 42 within the top 10000. This infers that by focusing exclusively on representation pairs with high functional similarity, we can recover 25% (52 pairs out of 210) and 18% (42 pairs out of 231) of pairs displaying known spurious correlations, merely by scrutinizing 2% (10000 pairs) of the total representation pairs. Our results suggest that FSCA tends to allocate high FSC scores to pairs of representations known to be susceptible to spurious correlations, thereby lending further credibility to the proposed methodology. However, it’s important to note a limitation in this experiment: while we have a knowledge of spurious correlations due to the reliance on Chinese watermarks, we cannot ascertain potential correlations among other pairs.

Fig. 4.
figure 4

The contrast between the distribution of FSC scores among pairs of representations known to be susceptible to spurious correlations (Chinese watermarks, orange), and all other pairs (blue). The figure demonstrates that FSCA typically assigns higher FSC scores to pairs recognized as having spurious correlations. (Color figure online)

4.2 Identifying Spurious Correlations in ImageNet Trained DenseNet-161

To demonstrate the potential utility and relevance of our proposed approach, we investigated in detail the results of the FSCA of the widely-used ImageNet-trained DenseNet-161 model. Hyperparameters for the analysis were kept the same as in the previous experiment, namely, we employed functional Extreme-Activation distance metric with \(n = 10\), allowing us to analyze the co-activation of representations based on the 10 Activation-Maximization images, providing a straightforward method for interpreting the shared features that the representations are trained to recognize. Due to the impracticality of examining all pairs, our analysis focused solely on pairs with high functional similarity based on Extreme-Activation, specifically those within the top 1% of the smallest distances, and in total 1000 with the highest contrast scores were analyzed. We report several significant categories of correlations observed between the logit class representations of the DenseNet-161 model, found by the FSCA method.

Fig. 5.
figure 5

Illustration of several representation pairs sharing natural visual features. The figure shows four different pairs of representations, with each subfigure depicting the geometry of pairwise RAVs and two n-AMS per representation. The observed functional similarities are attributed to the natural visual similarity between classes and are not considered spurious, as the representations detect features characteristic of each other.

Shared Visual Features. Since semantic distance offers a metric for evaluating the conceptual differences between entities, it is natural for some concepts, despite being semantically distinct, to share visual features with one another. Such relationships between representations could be considered natural to the image classification model.

Some of the most intriguing relationships we observed include the functional similarity between representations corresponding to the classes “geyser”, “steam locomotive”, “volcano”, and “cannon”, owing to the shared visual feature of smoke fumes. Representations for the classes “menu”, “website”, “envelope”, “book jacket”, and “packet” exhibit a high degree of functional relationship due to the shared textual feature, which could be considered a natural characteristic for such classes. Furthermore, representations for crossword “puzzle” and “snow leopard” share similar behavior in detecting black and white grid patterns (illustrated in Fig. 2), “waffle iron” and “manhole cover” representations display a high degree of similarity due to their ability to detect specific grid patterns, and “mailbox” and “birdhouse” logits demonstrate a strong degree of co-activation of each other’s n-AMS, resulting from the visual similarity of the objects. Several of the described relationships are illustrated in Fig. 5, by the pair-wise RAVs between their representations together with their n-AMS.

Co-occurring and Mislabeled Objects. This category refers to objects that frequently co-occur, allowing the network to learn associations between two objects, due to the constraints of the classification problem to assign one class per image. Examples of such relationships can be found in the representations of “cup”, “espresso maker”, “coffeepot”, and “teapot”, all reported to frequently co-occur in each other’s image backgrounds as secondary objects. Intriguing examples include the high similarity of “plate” and “dungeness crab” representations, as the n-AMS for the crab representation illustrates an already prepared crab on a plate, “cardoon” (flower) and “bee” representations, and “hay” and “harvester”. Additionally, we detected that functional similarity can be caused by misattribution of labels, such as between “tiger” and “tiger cat”, where we were able to determine that the latter class also contains images of tigers, even though the class description states that it is a specific breed of cats exhibiting textural patterns of black stripes, similar to tigers.

Object Backgrounds. FSCA analysis of the functionally most similar representations yielded several groups of representations, that exhibit functional similarity due to the shared background only. Such a conclusion could be derived from the fact that representations are significantly co-activated with each other’s n-AMS, while the only shared feature among them is the background.

  • Snow We can consider the snow background among the most interesting examples of such spurious correlations. This feature is shared between representations such as “snowmobile”, “ski”, and “shovel”.

  • Mountain The commonality of the mountain background is observed across representations including “alp”, “marmot”, “mountain bike”, and “mountain tent”, with the latter two possessing descriptive references to the background within their respective names.

  • Underwater The underwater background is shared between representations such as “snorkel”, “coral reef”, “scuba diver”, and “stingray”, which collectively share a bluish shade and describe natural marine environments.

  • Savannah The shared background of the savannah, characterized by golden or green grasslands, is observed across representations of animal species such as “zebra”, “impala”, “gazelle”, “prairie chicken”, and “bustard”.

  • Water The water background encompasses the view of the water surface, as well as the presence of animals or objects above the water, including “pier”, “speedboat”, “seashore”, and “killer whale”.

Artifacts. Among the reported pairs of representations yielded by FSCA, we were able to detect representations “safe”, “scale”, “apron”, “backpack”, “carton”, and “swab” that exhibited high functional similarities caused by the presence of Chinese watermarks in their n-AMS. This result is consistent with previous works that reported these classes as having a strong ability to differentiate between watermarked and non-watermarked images [20].

By employing FSCA we were able to identify the new unknown spurious correlation, manifesting in the dependence of several classes on the presence of young children in the image. A high functional similarity was reported between the “diaper” class, naturally containing a lot of young children in the images, and several other representations, including the “rocking chair” representation. Inspection of the training dataset revealed a significant amount of images of children (without diapers) sitting in a “rocking chair”. Since the ImageNet dataset does not have a specific class dedicated to children, this represents a latent factor that corresponds to the functional similarity of such classes.

Fig. 6.
figure 6

Discovery of previously unknown spurious correlation between “diaper” and “rocking chair”. The high FSCA contrast score (0.35), indicates a high discrepancy between function and semantic distances. Investigation of the training dataset revealed that such behavior could be explained by the dependency of both representations on the presence of a child in the image.

Fig. 7.
figure 7

Differences in the model’s predictions before and after adding an image of a child to the image.

Figure 6 illustrates the Extreme-Activation distance, alongside with the n-AMS for the representations “diaper” and “rocking chair”, and several examples from the ImageNet-2012 training dataset from the class “rocking chair”. This spurious correlation was unexpected and could be considered artifactual for this class. The fact that “rocking chair” employs the presence of children as additional evidence for prediction is demonstrated in Fig. 7, where the model’s prediction shifts towards the “rocking chair” class after adding an image of a child on top of the image of the chair. Furthermore, FSCA reported the following representations to have high functional similarity with the “diaper” representation: “crib”, “bassinet”Footnote 2, “cradle”, “hamper”, “band-aid”, “bib”Footnote 3, and “bath towel”.

Another intriguing and previously unknown spurious correlation that we identified involves the dependence of several classes on images of fishermen. This correlation was observed between the “reel” class and several fish classes, namely “coho” and “barracouta”. Figure 8 furnishes evidence that the relationship between the “reel” and “coho” representations is primarily based on the presence of fishermen, often paired with a specific water background. This is further underscored by the model’s prediction given an image of a fisherman - the model confidently assigns a fish label to the image, despite the absence of any fish in the picture. Although this correlation bears similarity to the previously reported correlation between the “tench” and the presence of human fingers [12], our findings show that representations like “coho” and “barracouta” display a broader dependency on the existence of a fisherman within the image. This is evidenced even in instances where human fingers are not visible in the image, as exemplified by the right-hand image in Fig. 8.

Fig. 8.
figure 8

Illustration of the spurious correlation between “reel” and “coho” representations, which appears to emerge due to the common latent feature of the presence of fishermen in the images. Our investigation revealed that a significant portion of the training dataset consists of images featuring fishermen. This relationship consequently leads to the possibility of the network misclassifying images of fishermen as the “fish” category.

Fig. 9.
figure 9

Chart represents the distribution of identified causes for the correlations among the top 1000 pairs of representations, having the highest reported function-semantic contrast (FSC) scores.

Summary. Our examination of the top 1000 pairs of representations, as ranked by function-semantic contrast scores, suggests that around half of the detected correlations might be explained as “unintended” correlations. These correlations can be linked to the frequent co-occurrence of objects (32%), dependencies on shared backgrounds (12.3%), or a shared unnatural factor (2.6%), as visualized in Fig. 9. Nevertheless, we recognize that such categorization might oversimplify the actual interconnections between representations. It is uncommon for a single specific factor to account for the functional similarities observed between neural representations.

4.3 Better Models Tend to Have Fewer Associations

The analysis of the DenseNet-161 models surfaced a variety of correlations, including those that might be deemed natural as well as those potentially regarded as undesired or even harmful. Subsequently, we were motivated to examine whether higher-performing models exhibited fewer correlations among their output representations. For this investigation, we gathered 78 different ImageNet classification models from the Torchvision library [47], with the weight parameters set to “IMAGENET1K_V1”. For each model, we computed the pairwise Extreme-Activation distance between output representations using the ImageNet-2012 validation dataset, leveraging parameters analogous to those in our preceding experiment. This process yielded 78 distance matrices \(F_i \in \mathbb {R}^{1000 \times 1000}, i \in [1, 78].\) To quantify the degree of correlation between output representations within models, we calculated the Frobenius norm of the difference between the Extreme-Activation distance matrix \(F_i\) and a matrix Q for each of the 78 models:

$$\begin{aligned} Q = \frac{1}{\sqrt{2}}\left( \mathbbm {1} - \mathbb {I}\right) , \end{aligned}$$

where \(\mathbbm {1}\) is a \(k \times k\) matrix with all elements equal to 1, \(\mathbb {I}\) is the identity matrix, and \(k = 1000\). Matrix Q is the distance matrix between representations in the ideal scenario of total disentanglement. Hence, the norm of the difference between \(F_i\) and Q serves as an indicator of the interconnectivity of the representations.

Figure 10 illustrates the correlation between the extent to which the models’ representations are correlated (top graph, y-axis) and their Top-5 performance on ImageNet (top graph, x-axis). Our observations indicate that models delivering superior performance achieve a lower norm, suggesting that enhanced performance aligns with better disentanglement and reduced correlation among output layer representations. The bottom graph in the same figure provides a visual representation of this, displaying distance matrices calculated across various networks.

Fig. 10.
figure 10

Better performing models achieve higher disentanglement of representations. The top figure illustrates the relationship between ImageNet top-5 validation accuracy (x-axis) across the 78 models from the Torchvision library, along with the Frobenius norm of the difference between the EA distance measure and Q (y-axis).

5 Discussion and Conclusion

In the present work, we introduce a new technique, Function-Semantic Contrast Analysis (FSCA), designed to uncover spurious correlations between representations, when target concepts are known. FSCA reduces human supervision by systematically scoring and ranking representation pairs based on the function-semantic contrast. We have demonstrated the feasibility of our approach by uncovering several potentially unrecognized class correlations as well as rediscovering known correlations.

The primary limitation of our method relates to its reliance on a semantic metric that, despite broadly reflecting visual similarity between objects, isn’t entirely accurate in assessing visual similarity between two concepts. We aim to research alternative semantic metrics, including expert-defined ones, that take visual similarity into account in our future work. Another challenge is the undefined nature of spurious correlations, necessitating human oversight to discern whether a correlation is harmful. Nevertheless, our study found that analyzing 1000 representation pairs from the DenseNet-161 model only required around 3 human hours, uncovering previously undetected artifacts, and hence, the demand for human supervision is significantly reduced by FSCA.

While we have demonstrated the applicability of FSCA on ImageNet-trained networks, this approach is scalable in terms of its application to other image classification problems. Since WordNet encompasses a broad range of synsets, it is often quite simple to connect classes and concepts, as shown in the example of CIFAR-100 [17, 41]. Moreover, semantic distance can be measured using other knowledge-based datasets or by relying on expert assessments.

As Deep Learning approaches are becoming more popular in various disciplines, it becomes increasingly imperative to audit these models for potential biases, ensuring the cultivation of fair and responsible machine learning frameworks. Our presented FSCA method offers a scalable solution for practitioners seeking to explain the often opaque and enigmatic behavior of these learning machines. By doing so, we contribute to a more transparent and ethically-grounded understanding of complex deep learning systems, promoting responsible and trustworthy AI applications across various domains.