1 Introduction

In recent years, Artificial Intelligence (AI) has made significant progress due to the availability of large datasets, powerful computing devices, and the development of sophisticated algorithms (Haenlein & Kaplan, 2019). Machine learning (ML) models are commonly used in AI systems because of their success in various fields such as image processing, time series analysis, and audio signal processing (Sarker, 2021; Purwins et al., 2019). However, traditional ML systems have a major limitation in that they rely on large-scale datasets, while real-world applications often have constraints that result in limited data availability (Ienca & Vayena, 2020; Jiang et al., 2014; Fries et al., 2017). Technical issues may limit the collection of training data, while ethical or privacy concerns may restrict data access (Ienca & Vayena, 2020). Furthermore, traditional ML systems struggle to generalize from few samples and their performance is often better for classes with more training samples and worse for classes with fewer samples (Rahman et al., 2018). As a result, these systems are limited in their ability to expand their knowledge beyond the scope of the data they were trained on. In contrast, humans are able to quickly generalize from prior knowledge. For example, if a child is presented with a few pictures of a person or animal he/she has never seen before, he/she will still be able to match and identify the correct individual among a reasonable number of pictures portraying different subjects/animals.

To overcome these limitations, recent studies proposed the use of few-shot learning frameworks, where a ML model must learn to classify new classes with only a limited number of labeled samples per class (Wang, 2020). Few-shot learning methods typically consider a C-way k-shot classification task, where C represents the number of classes the model must recognize, while k is the number of labeled samples per class. The set of labeled samples is known as the support set, i.e., an auxiliary set of data that serves as guidance for the classifier. The goal of few-shot learning is to predict the class of an instance when only a small number of examples of that specific class are available. In this paper, our focus is on classifying an unseen sample using just one labeled sample of that specific class during inference. This means that the model has had no exposure to the test samples during training, as we have kept the training, validation, and test sets separate. We refer to this task as C-way one-shot learning, signifying that the support set comprises a single sample for each of the C classes. Consequently, the model’s objective is to correctly classify an unseen query class with only one additional sample from that class in the support set during inference.

Several algorithms have been proposed to address the challenge few-shot metric-learning including Siamese Networks (Koch et al., 2015), Matching Networks (Vinyals et al., 2016), Relation Networks (Sung et al., 2018), and Prototypical Networks (Snell et al., 2017). These algorithms aim to learn a metric in an embedding space that can then be used to determine the similarity between unseen samples. Siamese Networks (SNs) (Koch et al., 2015) are composed of two or more identical encoding sub-networks that map inputs into an embedding space, where a distance function is applied to calculate the distance between the resulting embedded representations. A similarity score is then computed based on this distance in the latent space. To date, the few-shot learning paradigm has been applied to image processing (Koch et al., 2015; Qiao et al., 2018; Liu et al., 2019), time-series analysis (Iwata & Kumagai, 2020; Gupta et al., 2021), and audio classification (Vélez, 2018; Honka, 2019).

However, one of the main challenges of SNs is the lack of explainability. Indeed, it is challenging to comprehend how these architectures can correctly generalize on unseen samples. Understanding the reason why a model takes a specific decision is hugely important to developers, organizations, and end-users on which such decision falls upon. While end-users may prioritize understanding an outcome’s explanation over the outcome itself, developers can use explanations to identify potential issues with a model and repeat training procedures in a controlled environment. Also, stakeholders may need to comprehend the reasoning behind a model’s decisions before deploying it in real-world scenarios that may have a significant impact on people’s lives. Indeed, the “right to know” (Dimitrova, 2020), refers to an individual’s right to receive an explanation for a specific outcome produced by an algorithm. In recent years, researchers examined the eXplainable Artificial Intelligence (XAI) topic from various perspectives (Guidotti, 2019; Adadi, 2018; Miller, 2019). A characterization of XAI techniques distinguishes between gradient-based and perturbation-based approaches (Guidotti, 2019; Adadi, 2018). While both aim to understand the contribution that each input feature has on a specific outcome, they solve the problem differently. Gradient-based approaches estimate feature contribution through forward and backward propagation in the network, while perturbation-based methods perturb the input and measure changes in the output relative to the original input. Although many XAI methods received positive feedback from the research community, only a limited number are designed to explain Siamese Networks (SNs) and, to the best of our knowledge, none has been developed to work on different data types within the framework of C-way one-shot learning.

Fig. 1
figure 1

sinex explanation on a 5-way 1-shot classification task on the Caltech-UCSD Birds 200 dataset. Top row: query sample x followed by the support set samples \(s_1\) to \(s_5\) from left to right. The SN similarity computed between each \((x, s_i)\) pair is presented above the corresponding \(s_i\) sample. Bottom row: explanation heatmaps \(h_1\) to \(h_5\) corresponding to the \(s_1\) to \(s_5\) support samples. Heatmaps display positive (red), negative (blue), and neutral (gray) contributions to the similarity score. Class labels from \(c_1\) to \(c_5\) are: Rose breasted Grosbeak, Eared Grebe, Summer Tanager, Carolina Wren and Canada Warbler. This layout is repeated as such for all explanations on this paper. Best viewed in colors (Color figure online)

In this work, we introduce a local data-agnostic explanation method for SNs in the setting of C-way one-shot learning. Our SIamese Networks EXplainer (sinex) aims to uncover the decisive features that enable the model to learn and generalize in a manner that resembles human cognition when making predictions on image, time series, and audio data. We aim to use our explainer to address questions such as: What factors is the model considering when it accurately identifies the class of two unseen objects? What makes a particular sample more similar to one object than to others? Why is the model miss-classifying a specific object? sinex employs a perturbation approach to calculate the contribution of each feature based on a segment-weighted-average evaluation. Additionally, we present a coalition-based variant of sinex, called sinexc, which takes into account the interaction between different parts of the input. These contribution values can be visualized as heatmaps, providing an intuitive representation of the behavior of Siamese Networks. An illustrative sinex explanation of a 5-way 1-shot classification is presented in Fig. 1. In the top left corner, the query sample x to be classified is shown, followed by a support set consisting of 5 samples (labeled from \(s_1\) to \(s_5\)). Above each \(i^{th}\) support sample, the similarity value computed by the Siamese Network between the pair \((x, s_i)\), represented in the range of [0, 1], is displayed. The bottom row displays the sinex contribution heatmaps for each \(i^{th}\) sample in the support set, denoted as \(h_1\) to \(h_5\). Positive contributions are highlighted in red, while negative contributions are shown in blue. Areas in grey are considered neutral with respect to the similarity score outcome. These heatmaps are min-max normalized within the same tasks and can be compared between them. In this example, both \(h_1\) and \(h_3\) show the largest positively influencing segments overall.

We apply sinex to explain SNs in the context of 5-way one-shot learning on grayscale and RGB images, time-series, and audio data. Our results demonstrate the effectiveness of sinex in identifying positive and negative contributing areas to the network outcome. Furthermore, our approach uncovers limitations in SNs, including erroneous dependence on specific colors (RGB images) or pixels (grayscale images), which can result in classification errors. This paper extends the conference version (Fedele et al., 2022) in several aspects. First, we introduce a perturbation methodology for feature contribution allocation, that computes the importance of specific features by excluding them instead of solely considering them. Second, we generalize the applicability of sinex to any data type by extending the input data handled, including images and time series besides audio data. Third, we extend the explainer’s compatibility with the widely used 3-branched Siamese Network architectures (Hoffer, 2015), which, in turn, accomodate for distance-based SN approaches, besides the similarity-based ones previously addressed. Fourth, we introduce the \(\epsilon\)-LRP explainability technique as an additional baseline in our performance assessment. Fifth, we investigate the impact of shifted support sets on both the SN performance and the explainer itself. Finally, we perform a dependence-ness analysis of positive and negative contribution features by introducing two novel metrics.

The rest of paper is organized as follows. In Sect. 2, we review relevant studies. Section 3 outlines the problem we address, while in Sect. 4 we illustrate our proposed solutions. The results of our experiments are presented in Sect. 5 and extensively discussed in Sect. 6. Finally, Sect. 7 summarizes our contributions and suggests future research avenues.

2 Related works

In this section, we review existing works that aim to explain the behavior of Siamese Networks (SNs). In Zhang (2019) the authors focus on visualizing and sonifying the input patterns that activate certain neurons in the network using the activation maximization approach (Erhan et al., 2010). The authors suggest that by visualizing the patterns that activate random neurons from each layer, it is possible to get an idea of what the network considers important. However, the limited consideration of random neurons may result in an isolated effect with respect to the overall function of the SN. Additionally, it is not possible to determine whether the returned features have a positive or negative impact on the outcome. A similar approach is described in Acconcjaioco (2020), where the system is developed for one-shot learning to identify bird species. The authors visualize how audio spectrograms (Flanagan, 2013) are decomposed by each layer and consider the last one as the explanation layer. Both in Zhang (2019) and Acconcjaioco (2020), the explanation is only derived from the convolutional encoders and does not take into account the core similarity scoring layer of the SN.

In Utkin (2020), the authors use a special auto-encoder as the core of their explanation algorithm, similarly to Guidotti (2019); Looveren (2021); Naudé (2020). The encoder is trained to reconstruct the input instances of the training set using the embedded representation provided by the SN’s built-in embedder. The decoder is then trained to reconstruct the original input based on the hidden representation. Once the auto-encoder is trained, it receives a pair of inputs to explain, which are also given to the SN. The vectors produced by the SN’s encoders are then perturbed on what the authors refer to as important features. These features are chosen based on the smallest distance between the two inputs if they are semantically similar, or the largest distance if they are dissimilar. The perturbed vectors are then passed through the decoder, which maps them back to the original input space. The embedded vectors are randomly perturbed, and the mean contribution value of each feature is calculated as the difference between its value in the reconstructed input after perturbation and its value in the reconstructed input without perturbation. The proposal suffers from various weaknesses like including a large number of parameters, the need for a large amount of data to train the auto-encoder, the access to the training set, and the requirement of training an additional ML model. In addition, also in this case, the SN’s distance function and similarity score are bypassed by the explainer.

In Chen (2021), is introduced another post-hoc explanation approach for SNs. The authors highlight the concern that existing perturbation-based XAI methods might become overly sensitive to irrelevant perturbations when dealing with unseen instances in the support set. To address this issue, they find global, invariant salient features for individual object using self-supervision. Then, they formulate an optimization problem to adapt the global salient features to explain a SN prediction for an input pair. The adaptation balances the conformity to the invariance and the local flexibility when comparing a query to different references. The authors design a gradient descent algorithm to solve a constrained optimization problem with KL-divergence regularization. However, also Chen (2021) does not account for the similarity scoring layer and only operates on the embedded representation of the data.

In Ye (2020), a different approach is taken, where a class-to-class SN (C2C-SN) is trained to understand both the similarities and differences between classes. The authors show how the C2C-SN can be used for explanation purposes through prototypical case finding and contrasting cases. However, the proposal of Ye (2020) differs from our work as it does not query the model on unseen classes.

A very recent work (Tummala & Suresh, 2023) introduces a novel class activation mapping, which highlights critical regions for classification. The saliency map is crafted by employing the \(l_1\) distance lambda layer as a weight vector, which is then multiplied with the final convolutional layer responsible for query sample embedding.

Finally, in Fiaidhi et al. (2022), the authors exploit SNs to identify the ulcerative colitis from few training samples, while enhancing the network’s explainability by combining it with a LSTM model that provides relevant textual captions for a specific outcome. SNs are trained on RGB images using a triplet loss (Schroff et al., 2015) and are asked to classify eight different classes. With the help of expert-provided textual captions, the authors are able to train a LSTM model on textual input and form an explanation that combined the SN prediction with a textual caption learned from a field expert. Differently from the other existing explainers for SNs, this work shows that explanations might come in different forms and that SNs can be embedded into bigger architectures instead of being treated as standalone systems.

The explanation approach presented in this paper differs from those described in Zhang (2019), Acconcjaioco (2020), and Chen (2021) as they utilize gradient-based methods that primarily focus on the encoder component of SNs. These works only examine the last convolutional layer of the network, ignoring the overall architecture of the SN. Moreover, our approach eliminates the need for training additional models and focuses on assessing differences in the SN outcome through input space perturbations, unlike (Utkin, 2020). While sinex perturbs samples in the input space and measures differences in the SN outcome score, this approach perturbs the latent space post-CNN encoding and generates heatmaps for perturbed samples using a pre-trained decoder. Consequently, we decided not to use it for comparison since our goal is to evaluate the contribution of input features in the input space.

Additionally, our approach differs from Fiaidhi et al. (2022) as we aim to explain the SN as it is, without the need for training samples or external knowledge. Lastly, our explanation method can be directly applied during prediction time. Our aim is to provide a comprehensive explanation of how a SN works, with a particular emphasis on the layer that generates the final similarity or distance score. Additionally, our approach addresses the issue of explaining SNs within the context of C-way one-shot learning, an area that has not been explored in previous works.

3 Problem formulation

In few-shot learning, a dataset \(D = \langle X, y \rangle\) is composed of n input samples, \(X = \{x_1, \dots , x_n\}\), and their corresponding class labels, \(y = \{y_1, \dots , y_n\}\). The class labels indicate which \(y_i\) class each \(x_i\) input sample belongs to, where \(y_i \in [0, \dots , L-1]\). L is the total number of classes, which in few-shot learning is higher than traditional multi-class problems. In our experiments, L ranges from a minimum of 50 to a maximum of 60. In C-way one-shot learning, the support set S contains C input samples, each belonging to a different class and only appearing once in S. Typically, C is much smaller than L. In our experiments, C is set to 5. In this framework, a SN is a deep learning model f that takes as input a support (or reference) set \(S = \langle \{s_1, \dots , s_C\}, \{y_1, \dots , y_C\} \rangle\), a query instance x with an unknown label, and predicts the class label of x by comparing it to each input sample \(s_i\) in the support set. Each \(y_i\) present in the support set represent the class label of the sample \(s_i\). This comparison can be done using either a similarity sim (Koch et al., 2015), or a distance function, dis (Hoffer, 2015), to select the highest/lowest similarity/distance score, i.e.,

$$\begin{aligned} y_i = f(x, S) = \underset{\forall s_i, y_i \in S}{\arg \max }\ \; \text {sim}(x, s_i) \;\;\;\;\; y_i = f(x, S) = \underset{\forall s_i, y_i \in S}{\arg \min }\ \; \text {dis}(x, s_i). \end{aligned}$$

In this context, a correct classification occurs when the class \(y_i\) of the support sample \(s_i\), which yields the highest/lowest similarity/distance score, matches the class of the query input x. Otherwise, an incorrect classification occurs.

Fig. 2
figure 2

Left: RGB image of Rose breasted Grosbeak bird. Center: grayscale image representing a time-series formed by taking the height profile of a written word. Right: log-mel spectrogram of an audio recording featuring a speaker saying “zero” out loud. Darker areas represents lower dBs, i.e., silence, while lighter pixels indicate sounds that can be heard by humans (Color figure online)

In our setting, we consider samples in the dataset X including images, time-series, or audio inputs. An image is typically represented as a matrix in \(\mathbb {R}^{p \times q \times z}\), where p and q are the width and height Cartesian coordinate of the image and z is the number of color channels (usually 3 for RGB and 1 for grayscale images). Each element \(x_{i,j,k,w}\) represents the intensity value at width j, height k and color channel w for the \(i^{th}\) image. The intensity is usually expressed as a value within the range of [0, 255]. A time series is an ordered set of p real-valued observations (or time steps) in \(\mathbb {R}^{p}\). We represent a time series as a grayscale image by plotting it on a Cartesian coordinate system as \(x_{i,j,k}\) where j represents the value plotted on the temporal-axis and k represent the measurement value at that timestamp for the \(i^{th}\) time series. Finally, an audio can be represented as a spectrogram (Acconcjaioco, 2020; Flanagan, 2013), which consist of a matrix in \(\mathbb {R}^{p \times q}\), where q is the length of the audio track, and p is the number of observed frequencies. Each element of the matrix \(x_{i,j,k}\) represents the intensity value at time j of frequency k for the \(i^{th}\) audio track. The intensity is usually expressed in decibel (dB). Figure 2 shows an RGB image, a grayscale representation of a time-series and an audio spectrogram.

The main goal of this paper is to identify what a pre-trained SN f considers among the characteristics of the records in the support set S when assigning class \(y_i\) to a query instance x. To do so, we propose a local data-agnostic post-hoc explanation method g, which takes as input f, a set of support samples S, and a query instance x and returns an explanation E. More formally, our objective is to define a function g such that \(E = g(f, x, S)\). The explanation E is formalized as a set of heatmaps \(E = \{h_1, \dots , h_C \}\), where each \(h_i\) is the heatmap of the support sample \(s_i, y_i \in S\) and represents the importance/saliency for each feature of its matrix value. For RGB images, the value \(h_{i,j,k,w}\) indicate the importance of the pixel at width j and height k for the \(w^{th}\) channel color. The same definition applies for grayscale images, regardless of whether they were generated from a time-series input or not. For audio tracks, the value \(h_{i,j,k}\) indicates the importance/saliency of the \(k^{th}\) frequency at time j for the \(i^{th}\) support set spectrogram.

4 Siamese networks explainer

In this section, we describe sinex, a local data-agnostic SIamese Networks EXplainer for C-way one-shot learning. sinex is our proposal for implementing the function g to explain a SN f w.r.t. a query instance x and a support set S, as described in the previous section. In particular, sinex unveils the output of the SN by generating an explanation based on the final layers that measure either the similarity or the distance score. We design sinex as a perturbation-based method that measures the prediction of similarity in 2-branched SNs (2SN) or distance in 3-branched SNs (3SN) after various input perturbations and repeated queries of the SN. In our C-way one-shot setting where the query input x is classified based on the instances in the support set S, we choose to segment and perturb the instances in S and observe how the outcome estimation between x and \(s_i\) changes when parts of \(s_i\) are hidden from the network.

We introduce two perturbation approaches, and use sinex\(_\omega\) to distinguish them. The first approach with \(\omega {=} \mathtt {\lnot }\), named sinex\(_\lnot\), involves measuring the contribution of a specific segment by keeping it active while “silencing” all the other segments, i.e., replacing them with non-informative values. The second approach with \(\omega {=} \texttt{1}\), named sinex\(_\texttt{1}\), measures the contribution of a specific segment by “silencing” it, while keeping all other segments active. Examples of both perturbation procedures are illustrated in Fig. 3. In the following, we first elucidate the segmentation process that precedes the perturbation runs. Then, we present sinex and sinexc, a coalition-based variant of sinex that extends the perturbation procedure to multiple segments rather than single ones. After that, we delve into the explanations generated by sinex and sinexc, providing details on how to interpret and compare them. Finally, we describe how 2-branched SNs (2SNs) differ from 3-branched SNs (3SNs) and explore how our explainers can provide support for both.Footnote 1

Fig. 3
figure 3

First column: support set sample on the top row, and its segmentation on 56 regions in the bottom row. First row: \(\omega {=} \lnot\) perturbation on the 24th and 35th segments respectively. Second row: \(\omega {=} \texttt{1}\) perturbation on the same segments. Best viewed in colors (Color figure online)

4.1 SINEX segmentation approach

Input matrices, whether representing images, time series, or audio spectrograms, offer various avenues for perturbation. A common approach is the window-occlusion-based technique (Zeiler, 2014). However, this method faces several limitations. Fixed-size occlusion windows could yield inaccurate results, as the contribution of the features can vary significantly depending on the window size. Additionally, determining the correct window size can be difficult, as the same size may produce different outcomes even for different instances of the same class. Furthermore, dividing the input into fixed-length windows assumes that the two axes are independent, which is a rare assumption for inputs like images and spectrograms. For example, relying solely on time segmentation for a spectrogram means assuming that every sound event starts and ends at the same moment in all recordings of a given class, which is unrealistic in real-world scenarios.

Therefore, in order to provide a consistent approach for any type of data, aligning with the widely recognized data-agnostic explainer LIME (Ribeiro, 2016), we suggest segmenting each input using techniques commonly employed for image inputs. Examples of these techniques include the Felzenszwalb approach (Felzenszwalb, 2004), which uses minimum spanning tree-based clustering to segment an image, and SLIC (Achanta, 2012), which segments images through k-Means clustering. Additional examples include an extension of SLIC called MaskSLIC (Irving, 2016), that generates superpixels in specific regions of interest, Quickshift (Vedaldi & Soatto, 2008), a local mode-seeking algorithm based on a kernelized mean-shift approximation, as well as Watershed (Beucher, 1992), which identifies watershed basins in images flooded from user-given markers.

sinex does not use a default segmentation algorithm, hereafter referred to as \(seg\), as its selection depends on the data type and the context. However, it is flexible as it technically supports all the algorithms described earlier, allowing users to choose based on their specific needs. For a detailed discussion on the suggested procedure to choose the appropriate \(seg\), please refer to Appendix C.

Algorithm 1
figure a

sinex(fxS)

4.2 SINEX basic approach

In the following, the term “similarity” refers to the outcome of a 2SNs, meaning that the function f to explain is a 2SNs. Conversely, “distance” is used when f is a 3SNs. In both the pseudo-code and the subsequent description, “similarity/distsance” is used to generalize, as the supported f can be either a 2SN or a 3SN. Details on the support for such different architectures will be provided later in this section. The pseudo-code of the basic version of sinex is reported in Algorithm 1, and detailed as follows. For each sample in the support set S (lines 2–11), sinex calculates the initial similarity/distance v returned by the application of the SN f between the query input x and the support sample \(s_i\) (line 3). Then, it segments the support sample \(s_i\) using the \(seg\) segmentation algorithm (line 4), resulting in a set of segments R with at least \(\gamma\) segments. For each segment \(r_j\), its contribution is stored in the saliency map \(h_i\), which represents the importance of different areas in the support sample \(s_i\), and it is computed as follows. The saliency maps are initially set to a matrix with all values equal to 0, with the same dimensionality as the support set sample (line 5). The support set sample \(s_i\) is then perturbed (line 7) based on the selected perturbation approach \(\omega\), which can assume two values. If the \(\omega {=} \mathtt {\lnot }\) perturbation approach is used, everything except the region \(r_j\) is obscured. In this case the notation of line 7 becomes \(s_i[\lnot r_j] \leftarrow c\), that indicates that the support sample \(s_i\) assumes value c in all regions except \(r_j\), i.e., we obtain the same Algorithm described in Fedele et al. (2022). Otherwise, if \(\omega {=} \texttt{1}\), only the region \(r_j\) is obscured. In such case, notation at line 7 becomes \(s_i[r_j] \leftarrow c\) indicating that, the support sample \(s_i\) assumes the value c only in the \(r_j\) region. The value c in Algorithm 1 line 7 symbolizes a replacing value to be used when perturbing, which varies according to the sample data-type. For RGB images, any color might be a replacing value and it should be carefully chosen not to create out-of-distribution samples. In our study we use the gray color, which is commonly used for c in this case. Grayscale images are instead limited to only use color belonging to the grey-scale (from white to black). For this kind of input, we use the white as a replacing color for black pixels. Whenever dealing with audio spectrograms, a reasonable replacing value for c might be \(-80\) as it is the smallest value in the dB scale in many cases. After perturbation, the new similarity/distance score u is calculated by the application of the SN f between the query sample x and the perturbed support set sample \(z_i\) (line 8), and the difference \(\delta\) between the starting similarity/distance score v and the new score u is determined (line 9). Finally, the difference is weighted according to the size of the current segment \(|r_j|\) and updated in the corresponding saliency map \(h_i\) (line 10).

Algorithm 2
figure b

sinexc(fxS)

4.3 SINEX with coalitions

The basic version of sinex may suffer from well-known problems associated with perturbation-based methods. First, perturbing instances may result in generating out-of-distribution (OOD) data point, which may not guarantee the validity of the similarity/distance measures. To address this issue, one solution is to retrain the model on a dataset that includes the perturbed data points, but this requires additional time resources. Second, measuring the prediction changes of individual segment perturbations may help understand their contribution to the final outcome, but it may ignore the interaction between the segment and the other parts of the input (isolated effect). To tackle these challenges, we take into account the few-shot learning context and the fact that we evaluate the model on new classes, which make the similarity networks robust to OODs, since unseen sample might be considered OOD themselves. However, there is still no guarantee of the network f being robust when comparing a query sample x to a perturbed version of the support set sample \(s_i\), irrespective of whether x and \(s_i\) belong to known or never-seen-before classes. Hence, our proposal to mitigate these limitations is to evaluate the contribution of a specific segment to the final outcome by considering its weighted-average value. Drawing inspiration by Lundberg (2017), we aim for this value to take into account the interplay between the segment in analysis and the remaining others. However, differently from SHAP, our context does not allow us to compute “baseline values” based on the training set. Nevertheless, we can adopt an approach similar to LIME (Ribeiro, 2016), where we perturb not only a single segment but also other randomly selected super-pixels in each step. We aim for our explainer to not only assess the impact of a specific segment on the final outcome, but also consider how its interaction with other parts of the sample influences the final similarity/distance score. To do this, we have introduced two parameters, \(\alpha\) and \(\beta\), which control the number of per-segment coalitions to generate and the number of additional segments that must remain active or disabled in each coalition, depending on which \(\omega\) perturbation procedure is selected. It is important to emphasize that the selection of additional segments to target is random. Hence, we extend sinex to include coalitions in sinexc, as outlined in Algorithm 2 and detailed in the following.

Similar to the basic sinex version, sinexc calculates the initial similarity/distance v by applying the SN f between the query input x and the support sample \(s_i\) (line 3) for each sample in the support set S (lines 2–14). Then, the support sample \(s_i\) gets segmented using the \(seg\) segmentation algorithm (line 4), resulting in a set of segments R with at least \(\gamma\) segments. The contribution of each segment \(r_j\) is stored in the saliency map \(h_i\), initially set to a matrix with all values equal to 0, matching the dimensionality of the support set sample (line 5). Unlike sinex, sinexc now generates additional \(\alpha\) coalitions with \(\beta\) \(\omega\)-active segments for each \(r_j\) segment (line 7). Indeed, each coalition specifies \(\beta\) random segments different from \(r_j\) that will either remain active (\(\omega {=} \mathtt {\lnot }\)) or non-active (\(\omega {=} \texttt{1}\)) along with \(r_j\) during the following perturbation procedure. For each coalition \(\pi _k\) (lines 9–11, Algorithm 2), we perturb the support sample \(s_i\) based on the selected perturbation approach (line 10). If \(\omega {=} \mathtt {\lnot }\) perturbation approach is selected, line 10 of Algorithm 2 becomes \(z_i \leftarrow (s_i[\{ \lnot r_j\} + \lnot \pi _k] \leftarrow c)\), indicating that everything except \(r_j\) and the additional set of segments indicated by \(\pi _k\) is obscured. Otherwise, if \(\omega {=} \texttt{1}\), line 10 becomes \(z_i \leftarrow (s_i[\{r_j\} + \pi _k] \leftarrow c)\), indicating that only \(r_j\) and the additional set of segments indicated by \(\pi _k\) are obscured. Finally, the sum of the similarity scores obtained querying the model is stored in \(\bar{u}\). We then compute the segment contribution \(\delta\) (line 12, Algorithm 2), as the difference between the original similarity score, v, and the mean similarity value \(\bar{u}/ |\Pi |\) obtained from the \(|\Pi |\) coalitions. Lastly, we weight the segment-average prediction value according to the size of the segment in analysis (line 13, Algorithm 2). This process is iterated pairing the query sample x with every sample of the support set S (line 2–14, Algorithm 2), keeping x fixed and applying the coalition-based methodology to each \(s_i\) element of the support set S.

Fig. 4
figure 4

sinex explanation on a 5-way 1-shot classification task on a hand-drawn characters dataset. Top row: query sample x followed by the support set samples \(s_1\) to \(s_5\) from left to right. Bottom row: explanation heatmaps \(h_1\) to \(h_5\) corresponding to the \(s_1\) to \(s_5\) support samples. Samples belonging to the Latin (\(s_1\)), Oriya (\(s_4\)), and Hebrew (\(s_5\)) classes exhibit visual similarities. A segment in the top left corner of \(s_5\) is identified with a negative impact, as \(s_5\) would be more similar to the query x in its absence. Best viewed in colors (Color figure online)

4.4 SINEX explanation

Figure 4 illustrates an example of a sinex/sinexc explanation E that contains the saliency maps \(\{h_1, \dots , h_C\}\) (bottom row in Fig. 4) for each support set sample \(\{s_1, \dots , s_C\}\) (top row in Fig. 4). Each \(h_i\) holds the contribution value for each segment of the support sample \(s_i\), whether it is a positive or a negative value. These contribution values are normalized for comparability across all \(h_i \in E\) within each C-way one-shot task. In case of similarity based approaches, the scale is normalized in \([-N, +P]\), where N is the maximum value within the negative contributions and P is the maximum value of positive contributions instead. Vice-versa, the min-max scale for distance-based approaches is in \([-P, +N]\). This normalization process is intentionally tailored for each specific C-way one-shot task, enabling meaningful heatmap comparisons within the confines of that task. Thus, it only allow relative comparisons across different explanations, since each one is generated on distinct query sample x and support set S samples.

A color-map must be selected to visualize E such that the contribution values close to 0 are non influential to the final prediction. We adopt a blue-to-red color-map where negative contributing segments range from dark blue (stronger influence) to light blue (lower influence) and positive contributions range from light red (lower influence) to dark red (stronger influence). Non influential pixels are colored in gray. To respect the definitions of positive and negative colors, an inverse color-map is applied for the distance-based approach.

Combining such dual min-max scale for normalization and the inverse color-map for visualization, we ensure that the cognitive workload of the end-user watching the explanation’s heatmap is oriented to a unique semantic meaning of positive or negative influencing segments. Thus, regardless of the approach being based on similarity or distances, sinex outputs the same visual result.

4.5 SINEX support for 2SN and 3SN

The SN architecture comprises two primary components. The first one transforms the inputs into an embedding space, while the second component evaluates the proximity of the embedded inputs. The first component can either be implemented as a 2-input-branch (2SN) (Koch et al., 2015) or a 3-input-branch (3SN) (Hoffer, 2015) network, with the branches responsible for input embedding. Once the inputs are embedded, the second component assesses their proximity, producing either a similarity or distance score for 2SNs and 3SNs, respectively. In our study, we use both 2SNs and 3SNs. 2SNs are explored by leveraging their similarity scoring output, which is more commonly described in the literature. We also consider 3SNs due to their superior discriminative performance compared to 2SNs, especially in challenging tasks. While 2SNs produce similarity scores, 3SNs generate distance values. Throughout the manuscript, similarity will refer to an underlying architecture in the form of 2SNs, while distance will be used for 3SNs. In 2SNs, the query input x and the support set instance \(s_i\) are provided to the network, and the changes in the similarity score are captured after each \(s_i\) perturbation run. On the other hand, 3SNs take three inputs and are typically trained using a Triplet Loss function (Schroff et al., 2015), which enforces that dissimilar pairs are separated by a certain margin compared to similar pairs. The three inputs for the network are the query instance x, a positive sample \(x_{=}\) from the same class as the query, and a negative sample \(x_{\ne }\). 3SNs are designed to measure the distance score between \(<x, x_{=}>\) and \(<x, x_{\ne }>\) pairs independently, by means of the same distance function and prior to the application of the triplet loss function. This creates a point of attachment for our explainers within the 3SNs architecture, which is not necessary in 2SN. In order to use sinex within this context, \(x_{\ne }\) is disregarded and the support sample \(s_i\) is represented by the positive sample \(x_{=}\) in the \(<x, x_{=}>\) pair distance scoring.

5 Experiments

This section describes the experiments we conducted on five different datasets to validate sinex and sinexc, both implemented in Python.Footnote 2 After presenting the experimental setting we report a qualitative and quantitative evaluation of the explanations. Then, we analyze sinex and sinexc using both the \(\omega {=} \mathtt {\lnot }\) and \(\omega {=} \texttt{1}\) perturbation procedures. We also compare sinex against conventional explainability techniques to assess their capability in identifying informative segments within the broader SN architecture. Finally, we further investigate positive and negative contribution segments using the novel wpe and wne metrics.

Table 1 Left: characteristics of the datasets and their dimensions before (pre-dims) and after our processing (post-dims), with n being the total number of records, and L the number of classes used at training time

5.1 Experimental setting

We selected five diverse dataset for classification, encompassing two image datasets, namely Omniglot (OGT) and Caltech-UCSD Birds 200 (CUB), a time-series dataset known as FiftyWords (50W), and two audio datasets, AudioMNIST (AST) and ESC-50 (ESC). Our selection of these datasets was guided by the aim of developing few-shot learning models, a task that is greatly enhanced by the presence of a large number of classes during training. In Table 1 we report a comprehensive overview of the datasets showing the total number of records, their data-type, and their dimensions before and after the pre-processing. For consistency across all datasets, we utilized five classes for validation and another five distinct classes for testing. Importantly, the testing classes were entirely separate from the training classes, ensuring a robust evaluation of our models. Details about the datasets and the pre-processing we performed are available in Appendix A.

We tailored the choice of SN architecture to align with the characteristics of each dataset. For AST, ESC, OGT and 50W datasets, following established practices (Koch et al., 2015; Honka, 2019; Acconcjaioco, 2020; Zhang, 2019), we employed a 2-branched SN with slight variations to suit each dataset’s unique features. These 2SNs consist of two convolution-based encoders, a distance layer, and a final output similarity scoring layer. In contrast, for the CUB dataset, due to the more challenging discriminatory nature (3 RGB channels), we implemented a 3SN architecture, in line with (Hoffer, 2015). This 3SN architecture includes three encoding-branches sharing the same architecture and weights between them. It differs from the 2SNs in terms of output and training requirements. Specifically, the 2SNs used binary Binary cross-entropy loss for classifying input pairs as belonging to the same class (label 1) or different classes (label 0). On the other hand, the 3SN used Triplet Loss (Schroff et al., 2015) with a margin of 0.5 to encourage a clear separation between similar and dissimilar input pairs. More details regarding the architectures and training processes can be found in Appendix B.

We evaluated the SN performance by conducting assessments every 100 training epochs on 300 randomly generated 5-way 1-shot tasks. In each task, the model’s objective was to classify a given query input, denoted as x, by comparing it with each support sample, \(s_i \in S\) (as described in Sect. 3). Successful classification occurred when the class, \(y_i\), of the support sample \(s_i\) with the highest similarity score (or lowest distance) matched the class of the query input x. Importantly, all evaluations were performed using sets that exclusively contained unseen classes, ensuring that no samples from the test classes had been encountered during the model’s training. We terminated the training procedure for all datasets when the model did not exhibit an improvement in 5-way 1-shot accuracy for 10 consecutive evaluation runs. Table 1 shows the mean 5-way one-shot accuracy for each dataset, as well as the accuracy on each singular class, to provide a complete overview of the model’s performance. Although the goal of this study is not to develop the best-performing SN in the given settings, we relied on average accuracy which can be considered satisfactory in 5-way 1-shot learning. The final mean accuracy of the SNs used in this study is in line, if not better (ESC (Honka, 2019)), than state-of-the-art networks in the same setting.

To maintain consistency across experiments, we used a uniform setting for the replacing value (c parameter in Algorithm 1 and Algorithm 2). This involved selecting a silence value of \(-80dB\) for spectrograms (AST and ESC), a white background for grayscale images and coverted time series (OGT and 50W), and a gray RGB color [128, 128, 128] for CUB. The segmentation algorithms, referred to as the \(seg\) in Algorithm 1 and Algorithm 2, and their parameter configurations were selected according to the specific data type. Appendix C provides a detailed description of the procedure that guided us in this selection and the parameters chosen for all experiments. Lastly, we set the sinexc \(\alpha\) parameter at 200 for all experiments. Preliminary tests indicated that increasing this value further did not yield any improvement in terms of measured performance and would only negatively impact run times. Also, in Appendix G, we show how sinex explanations remain coherent across different sample augmentations, provided that the SN’s performance is not affected by the shifting.

Fig. 5
figure 5

sinex\(_\texttt{1}\) explanation on a 5-way 1-shot task on the AST dataset. Class labels of the test set are composed of one female speaker (\(y = 56\)) and four male speakers. Best viewed in colors (Color figure online)

Fig. 6
figure 6

sinex\(_\texttt{1}\) explanation on a 5-way 1-shot task on the 50W dataset. Each time-series represents the height profile of a written word. Best viewed in colors (Color figure online)

5.2 Qualitative evaluation

Figures 1, 5, and 6 showcase examples of sinex explanations on CUB, AST, and 50W, respectively.Footnote 3 We recall the reader that blue areas represent segments of negative influence, while red portions indicate segments that positively affect the similarity score outcome. Grey areas are instead neutral to the SN classification process.

Concerning Fig. 1, sinex highlights that the correct classification of the Rose breasted Grosbeak class in CUB depends primarily on the bird’s red breast color, while there is a possible miss-classification error towards the Summer Tanager class due to their similar red body color. This error is more likely to occur when the support set sample of the Rose breasted Grosbeak class contains a distant bird, making it difficult for the network to identify the red breast, and leading it to rely more on the full red body of the Summer Tanager bird. An example of such miss-classification is illustrated in Fig. 11 in Appendix E.

Figure 5 shows an example of explanation for AST. The query class represents a male produced audio scoring a .93 similarity value for the correct \(s_1\) support sample, despite triggering both \(s_2\) and \(s_5\) so to reach .25 and .45 scores respectively. The explanation helps in understanding that \(s_1\) is mainly composed of positive segments, which have an absolute influence bigger than the positive segments of the other two samples. In addition, samples \(s_2\) and \(s_5\), present spectrogram portions which get marked as negatively impacting on the model outcome, therefore decreasing the similarity score value. Using sinex, we analyzed several 5-way one-shot tasks for AST and found that correct classifications of male/female speaker recordings depend mainly on medium-high/low frequency segments respectively, while miss-classifications are primarily due to segments at the opposite end of the frequency spectrum. An example of such miss-classification is illustrated in Fig. 13 in Appendix E. Additionally, unlike our previous study where \(\omega {=} \lnot\) perturbation procedure is discussed (Fedele et al., 2022), we found that setting \(\omega {=} \texttt{1}\) removes the issue of explanations relying on silent areas in spectrograms.

In Fig. 6, we report an explanation of sinex for 50W, which reveals why the SN correctly classifies the query class labeled as 35 and highlights possible miss-classifications with class 5. The support sample \(s_1\), which belong to the same class of the query sample x, achieves a .91 similarity score, thanks to a distinctive up-down-up trend at the beginning of the time-series. In contrast, \(s_3\) has a more relaxed curve drop that resembles the query input, which could lead to miss-classification with class 5. This is a lucky case example for \(s_1\), since \(s_3\) reaches a very close similarity score of .90. Since \(y = 35\) and \(y = 5\) classes are very similar, the form of the query sample is fundamental for the final classification. We found that most of the \(y = 35\) time series share the fast up-down-up trend of \(s_1\), but the variability of word outlines can make them more similar to other classes. An example of such miss-classification is illustrated in Fig. 12 in Appendix E.

5.3 Quantitative evaluation

We evaluated the qualitative significance of the explanations generated by sinex following the methodology described in Petsiuk (2018). In particular, we calculated the insertion and deletion scores by incrementally adding or removing the most influential pixels identified by our explainers, starting from an empty or full object, respectively. We expect the insertion curve to exhibit a rapid increase after inserting only a small percentage of relevant pixels, resulting in a large insertion-area-under-curve \(iAUC\). Conversely, we expected the deletion curve to exhibit a rapid decrease after only a few pixel removals, resulting in a small deletion-area-under-curve \(dAUC\). A value close to one for \(iAUC\), or close to zero for \(dAUC\) indicates that the explainer successfully identifies the important pixels for the classification process (Petsiuk, 2018). As a preliminary experiment, we studied the impact of \(\beta\) on sinexc. Detailed results are available in Appendix D. Based on these findings, we identified the \(\beta\) prioritizing a balance between high \(iAUC\) and low \(dAUC\) for each dataset and used them in the following experiments. Our selections, is listed in Table 4 in the same Appendix. In our evaluation, we included the Gradient-weighted Class Activation (grad-cam) and Epsilon Layer-wise Relevance Propagation (\(\epsilon\)-LRP) techniques (Selvaraju, 2020; Bach et al., 2015) for comparison. Differently from sinex, grad-cam utilizes a gradient-based approach, and it is commonly employed to analyze how convolutional based neural networks break down matrix-like inputs (Moujahid et al., 2022; Majid et al., 2022) at different convolutional stages. In our study, we use grad-cam on the last layer of the CNN responsible for embedding, as it captures the final attention towards the input. The LRP technique explains predictions by back-propagating the outcome through the model to assign a relevance score for each layer, employing specifically designed propagation rules. In our application of the LRP technique, we utilize the \(\epsilon\) rule, referred to as \(\epsilon\)-LRP henceforth, with \(\epsilon =1\), targeting the entire CNN for enriched explainability. While grad-cam and \(\epsilon\)-LRP are established techniques in the literature, they do not provide complete explanations for the SN as sinex does, as they cannot be applied on the distance layer. Therefore, these methods should not be considered “proper baselines” in the conventional sense, but rather comparisons for XAI techniques in the context of few-shot learning. The purpose of this comparison is to assess the degree of alignment between the features emphasized by the embedding networks and those utilized by the SN in its final scoring process. We compared the performance of sinex with its coalition version, sinexc, using both \(\omega {=} \lnot\) and \(\omega {=} \texttt{1}\) perturbation procedures. The rationale for this comparison lies in assessing how different perturbation methods affect the explanation generation process. We measured the mean \(iAUC\) and \(dAUC\) values for 150 5-way one-shot tasks per dataset, split it into 30 experiments for each of the 5 test classes. The results in Table 2 compare sinex\(_\lnot\), sinex\(_\texttt{1}\), sinexc\(_\lnot\), sinexc\(_\texttt{1}\), grad-cam and \(\epsilon\)-LRP for the same one-shot tasks. Bold highlights the best scores, aiming for high \(iAUC\) and low \(dAUC\). Also, Fig. 7 presents the average \(iAUC\) and \(dAUC\) curves for each dataset, showcasing the performance of the different explainability algorithm tested.

Table 2 Mean \(iAUC\), \(dAUC\), and execution time (in min) for the same randomly generated 150 tasks of 5-way one-shot learning for each of the five listed datasets

Our analysis points to \(\omega {=} \texttt{1}\) as the more effective perturbation procedure. Specifically, sinex\(_\texttt{1}\) outperforms other variations, showing excellent performance on four out of five datasets. While its coalition counterpart, sinexc\(_\texttt{1}\), surpasses it marginally only in 50W, the difference is not significant. For the most part, \(\omega {=} \texttt{1}\) consistently yields satisfactory \(iAUC\) and \(dAUC\) results. Notably, it excels in CUB and AST, with commendable performance in ESC and 50W. Despite having the best \(dAUC\) among all datasets, OGT exhibits the poorest performance in terms of \(iAUC\), with the lowest value compared to other datasets. This could be attributed to the distinctive nature of OGT, featuring black characters on a white background. The immediate increase in \(iAUC\) is not attained, as, at initial insertion percentages, the majority of the sample comprises only the background color. Consequently, a substantial number of black pixel insertions is required to convey valuable information to the model. This behavior is somehow reflected in 50W, which consists of black lines on a white background. Despite having the best \(dAUC\) among all datasets (0.14), the \(iAUC\) value for 50W is the second lowest after OGT. In general, sinexc\(_\texttt{1}\) does not show any significant improvement over sinex\(_\texttt{1}\). However, the good performance of sinexc\(_\texttt{1}\) on 50W suggests that coalition-based perturbation techniques could be useful in certain scenarios.

Turning our attention to \(\omega {=} \lnot\), we observe a decline in performance compared to the \(\omega {=} \texttt{1}\) configuration, affecting both sinex and sinexc. The insertion and deletion curves lack the desired abrupt rise and fall, suggesting a less than ideal perturbation approach. This observation is consistent with the results of our quantitative \(\beta\) analysis in Appendix D. Similarly to the findings under \(\omega {=} \texttt{1}\), we do not observe significant enhancements in \(iAUC\) or \(dAUC\) when sinexc is used in the \(\omega {=} \lnot\) setting. This suggests that to discover a given input-segments contribution to the final outcome, it is much more effective to deactivate the segment while maintaining intact the context in which it is included (\(\omega {=} \texttt{1}\)), rather than deactivating the whole context to only keep the segment in analysis as active (\(\omega {=} \lnot\)). This behavior, which aligns with human intuition when evaluating the similarity between objects, is mirrored in the \(\omega {=} \texttt{1}\) perturbation approach of our explainer.

Furthermore, when comparing to the \(\omega {=} \texttt{1}\) configuration, grad-cam and \(\epsilon\)-LRP typically demonstrate poorer performance across various datasets. Exceptions include 50W, where grad-cam performs well in both measures simultaneously, and in the cases of OGT-\(dAUC\) and CUB-\(iAUC\) separately. These results indicate that the pixel areas considered interesting for the embedding sub-network within the SN architecture, according to grad-cam and \(\epsilon\)-LRP, may not be essential for the SN’s overall scoring. This conclusion is in line with our understanding that sinex is designed to provide a comprehensive explanation of the entire SN, while grad-cam and \(\epsilon\)-LRP are not.

The runtime performance analysis of the compared explainability methods reveals that sinex outperforms its coalition version, being 12 times faster across all datasets. This observation is intuitive, given that sinexc has to navigate through the extra \(\alpha\) coalition perturbations. Notably, the perturbation procedure \(\omega\) does not impact the execution time for either sinex or sinexc. While the grad-cam technique stands out as the fastest overall, it comes at the expense of the measured performances in terms of iAUC and dAUC. This efficiency in grad-cam’s runtime is primarily attributed to its gradient-based approach, requiring only forward and backward passes through the CNN. On the other hand, sinex and sinexc are perturbation-based approaches, measuring the output difference after a set of perturbations. Such perturbation-based approaches are generally slower than gradient-based ones, as confirmed by our experiments. The slowest execution times for both sinex and sinexc are observed for the largest dataset CUB (3 channels). Additionally, the increased execution times on the CUB dataset can be attributed to its use of a 3SN, which is computationally more intensive than a 2-branched SN. This is due to the presence of three duplicated embedding CNNs and three inputs, as opposed to two in the 2SN. Interestingly, sinex exhibits similar time performance to the gradient-based technique \(\epsilon\)-LRP. However, we remark that both sinex and sinexc have potential for time-performance improvement. Indeed, in sinex, the perturbations of individual segments (Algorithm1, line 7) are independent and could be parallelized. Similarly, in sinexc, perturbations of single segments (Algorithm2, line 10) could be computed in parallel once \(\alpha\) coalitions are generated and shared (Algorithm 2, line 7). Also, it is important to note that the time-performance is heavily reliant on the segmentation algorithm \(seg\). Therefore, a preliminary assessments to select an appropriate segmentation algorithm and parameter settings is recommended. In Appendix C we discuss our proposal for selecting the segmentation technique \(seg\) for different data types.

Fig. 7
figure 7

Insertions (left) and deletions (right) curves for sinex\(_\texttt{1}\) on AST, ESC, 50, CUB and sinexc\(_\texttt{1}\) on 50W in the first row. Second and third rows respectively display curves for grad-cam and \(\epsilon\)-LRP on the same datasets (Color figure online)

Table 2 shows that the optimal configurations for each dataset are sinex\(_\texttt{1}\) for AST, ESC, OGT, and CUB, and sinexc\(_\texttt{1}\) for 50W. When examining the \(iAUC\) curves of sinex or sinexc, we observe a consistent and smoother trend across all datasets, differing from both grad-cam and \(\epsilon\)-LRP behavior. The sinex or sinexc \(iAUC\) curve display a swift rise in similarity scores, followed by stabilization. Conversely, grad-cam exhibits slower responses and is more prone to perturbations, particularly when a high percentage of pixels is inserted. This behavior may arise because the pixels highlighted by grad-cam as crucial for the embedding process might not accurately represent those essential for SN similarity scoring. For instance, consider the AST iAUC curve in grad-cam. It experiences a decline as the last 20% to 10% of remaining pixels are inserted. However, when the last 10% of pixels, i.e., those considered less important by grad-cam, are inserted, the SN’s similarity score rebounds from 0.4 to 1. This suggests that the pixels grad-cam deems of marginal importance may exert a mixed influence, being very negative from the last 20% to the last 10% and very positive from the last 10% to the complete picture. Insertion curves for the \(\epsilon\)-LRP technique, generally show a slower rise than grad-cam ones, but are less prone to perturbation at high percentage of pixels insertion. Indeed, in the insertion curves of \(\epsilon\)-LRP, there is no dataset where significant drops and subsequent rises occur after inserting more than 80% of pixels, as observed in grad-cam. However, such slow rise of insertion curves is the slowest in \(\epsilon\)-LRP if compared to both grad-cam and sinex and it is mainly due to the fact that \(\epsilon\)-LRP highlight as the most positive influential pixels areas that are typically at the border of the sample semantic value (i.e, the drawn line in OGT and 50W or the audible portion of high dB value in audio data). For instance, when applied to OGT, \(\epsilon\)-LRP highlights as important the background pixels that are at the boundaries of the written character, rather than the character itself. The entire character is considered neutral, leading to an immediate peak where approximately 0.5% of the pixels are inserted. Remarkably, even for the worst-performing iAUC curve for the OGT dataset, sinex still outperforms both grad-cam and \(\epsilon\)-LRP.

In terms of dAUC, sinex exhibits smoother trends compared to grad-cam and \(\epsilon\)-LRP. All dAUC curves, except for ESC, show a rapid drop even before 0.2% of pixels are removed, indicating very good performance. On the other hand, grad-cam only performs well for 50W and OGT datasets, with poor results for the remaining three datasets, which exhibit a much slower drop and are highly susceptible to perturbations at high percentages of pixels removed. Deletion curves for \(\epsilon\)-LRP are not satisfactory either, since they all seem to achieve the desired drop only after that 50% of the pixels are removed. The best \(\epsilon\)-LRP deletion curve is on the ESC dataset, which drops earlier than sinex and stabilizes at lower similarity scores. In this case, relevant pixels align with the audible portions of the sample, characterized by higher dB values, while the background has minimal influence. Unfortunately, \(\epsilon\)-LRP explanations for the ESC dataset are quite sparse with both positive and negative pixels scattered across the entire spectrum, therefore not conveying clear semantic insights. The superior performance of \(\epsilon\)-LRP on the ESC dataset compared to sinex may be attributed to the fact that sinex relies on super-pixel perturbations, potentially overlooking smaller, equally important pixels scattered sparsely across the sample. The least favorable deletion curve among all tested explainability methods is observed with \(\epsilon\)-LRP on the AST dataset. The curve displays two valleys and one peak when removing 20% to 40% of pixels. This suggests that the order of importance returned by \(\epsilon\)-LRP in this case may not align with the actual pixel importance within the samples.

In general, \(\epsilon\)-LRP performs worse in terms of both iAUC and dAUC compared to grad-cam. Furthermore, while grad-cam achieves acceptable results only for 50W, \(\epsilon\)-LRP consistently fails to exhibit high iAUC and low dAUC simultaneously. Our quantitative analysis indicates that sinex performs exceptionally well when combined with the \(\omega {=} \texttt{1}\) perturbation approach. This not only enhances its explanatory capabilities but also makes it a suitable choice for real-time applications requiring swift model explanations.

5.4 Dependence between positive and negative segments

We delved into an exploration of the dependence between positive and negative contributing segments within sinex. The motivation behind this choice arises from the observation that sinex demonstrates smoother iAUC and dAUC curves than \(\epsilon\)-LRP, implying a more precise assessment of the importance order of various segments. Unlike sinex and \(\epsilon\)-LRP, grad-cam does not identify positive and negative contributing pixels and is therefore excluded from this analysis. Specifically, we aimed to discern whether positive segments derive their positive influence solely from the presence of their negative counterparts, or if the reverse holds true. To investigate this, we introduced two novel metrics, wpe and wne, building upon the insertion and deletion processes outlined in Petsiuk (2018). wpe, short for Whole input \(\longleftrightarrow\) Positive only segments \(\longleftrightarrow\) Empty input , involves a two-step deletion process and a two-step insertion process. The deletion process commences with the entire input and progressively removes negative segments to obtain an intermediary state containing only positive segments. Subsequently, all positive segments are removed to arrive at an empty sample. In contrast, the insertion process starts with an empty input and introduces positive segments followed by negative ones to restore the original input. Conversely, wne stands for Whole input \(\longleftrightarrow\) Negative only segments \(\longleftrightarrow\) Empty input . In this scenario, the intermediary state requires solely negative segments, and the deletion process begins with the whole input, removing positive segments first and subsequently eliminating the remaining negative segments. The insertion process starts with an empty sample and introduces negative segments followed by positive segments to reconstruct the original input. In our insertion and deletion framework, we always add or remove segments from the most to the least influential at each intermediary step, regardless of whether we are using wpe or wne. The idea is that by analyzing the area under the curve at each intermediary step, we can better determine the dependence between positive and negative segments. The areas under the curve are referred to as \(dAUC\underline{X}_{w}\) and \(iAUC\underline{X}_{w}\) indicating the deletion/insertion of negative segments for \(\underline{X}=N\), or positive segments for \(\underline{X}=P\), and \(w \in \{{{\textsc {wpe}}}, {{\textsc {wne}}}\}\). By examining these areas, we can gain insight into the individual impact of positive and negative samples during both the deletion and insertion processes.

Table 3 Deletion and insertion area-under-the-curve (AUC) for wpe and wne for CUB, 50W and AST on both correct and incorrect 5-way one-shot classification tasks
Fig. 8
figure 8

Deletion (left) and insertion (right) curves for wpe (top row) and wne (bottom row) procedures on the CUB dataset. Blue colored portions indicate the removal or insertion of negative segments. Red colored areas represent positive segments. Best viewed in color (Color figure online)

Building on the best-performing sinex/sinexc settings from our prior analysis, we conducted 300 experiments for each dataset, including CUB, 50W, and AST. These experiments were categorized into thirty correct and thirty incorrect 5-way one-shot tasks for each of the five test classes. We remind the reader that a correct classification occurs when the class \(y_i\) of the support sample \(s_i\), which achieves the highest/lowest similarity/distance score, matches the class of the query input x. Otherwise, an incorrect classification occurs. These experiments differ from previous ones for two main reasons. Firstly, we measure the resulting area-under-the-curve separately for positive and negative contributing segments, following a specified order (Whole input \(\longleftrightarrow\) Positive only segments \(\longleftrightarrow\) Empty input for wpe and Whole input \(\longleftrightarrow\) Negative only segments \(\longleftrightarrow\) Empty input for wne). In contrast, previous experiments considered the order of positive segments first and then negative segments, regardless of insertion or deletion procedures. Notably, the insertion procedure for wpe and the deletion procedure for wne remain consistent with the previous experiments. Secondly, these experiments include tasks of miss-classification, whereas previous experiments only focused on correct classifications. Our results, presented in Table 3, showcase mean area-under-curve values, demonstrating consistent outcomes across datasets and between correct and incorrect classification tasks within the same dataset. This consistency implies the robustness of the segments identified by sinex, capable of pinpointing both true positive and true negative influencing segments, irrespective of the classification accuracy. In Fig. 8, we present the mean wpe and wne insertion and deletion curves for CUB, reflecting the results from 300 5-way one-shot tasks. Comparable curves for 50W and AST are provided in Appendix F.

Our observations for wpe reveal high \(dAUCN {wpe}\) and low \(dAUCP {wpe}\) values, signifying that removing only negative segments initially does not lead to a curve drop and instead elevates the predicted similarity scores, aligning with the definition of negative segments in Sect. 4. Subsequent removal of positive segments causes the desired curve drop, reflected in low \(dAUCP {wpe}\). On the contrary, the addition of important positive segments results in high \(iAUCP {wpe}\) and an immediate curve increase. The curve typically stabilizes or experiences a minor increase after the introduction of negative segments, as indicated by \(iAUCN _{wpe}\). In contrast, the results for wne exhibit an immediate drop in low \(dAUCP {wne}\) as soon as positive segments are removed. The curve may stabilize, increase, or decrease when negative segments are removed, as shown by \(dAUCN {wne}\), implying that negative segments do not exhibit a clearly defined behavior or a strong influence when they are the sole segments present in the sample. The behavior is also evident in \(dAUCN {wne}\) curves for 50W and AST, presented in Figs. 14 and 15 in Appendix F. Low \(iAUCN {wne}\) further supports this aspect, demonstrating an immediate curve rise as soon as positive segments are inserted, as indicated by high \(iAUCP _{wne}\).

In summary, these findings suggest that negative segments rely on positive ones, while positive segments operate independently of the negative ones. Combining sinex and the dependence analysis, developers can be better guided in refining the training of SNs. If positive segments are independent and have the greatest influence on the classification, and a miss-classification is caused by positive segments, there might be the need of fine-tuning the corresponding features that these segments focus on, i.e., the red color of the bird’s breast. However, retraining on areas indicated by negative segments may not be necessary since they do not have a strong impact.

6 Discussion

The qualitative evaluation of the explanations demonstrates that sinex is effective in uncovering limitations that SNs might encounter. For example, it can reveal the erroneous dependence of the model on specific colors in the case of RGB images or pixels for grayscale images that should not be considered as important features. For instance, we can consider the model’s reliance on the red color in the in the CUB dataset presented examples. To address this issue, a potential approach might consist in training the SN using color masks selected through sinex on the training samples to avoid such biases. Additionally, augmentation techniques, such as rotating the training samples, can be effective in preventing the development of strong associations between a class and a specific image region, like the bird’s breast, which may appear in different locations across various samples. These suggestions, which stem from a post-hoc sinex-guided analysis, can be addressed in two ways. The first approach involves retraining the SN from scratch with an augmented version of the original training set. The second approach is a subsequent fine-tuning phase using an additional dataset that targets the limitations highlighted by sinex, such as augmented samples and color masks. In both cases, sinex can guide in deciding the augmentation technique to use.

The quantitative evaluation using \(iAUC\) and \(dAUC\) scores is an effective method for assessing the explainer’s performance. This approach might reveal, for instance, potential weaknesses that SNs may encounter when working with grayscale images. The OGT and 50W datasets exhibited the lowest iAUC values, scoring 0.42 and 0.70, respectively, while both achieved the highest \(dAUC\) score of 0.14. These results suggest that the explainer can indeed identify important pixels, as removing them individually leads to a decrease in deletion scores. However, introducing only a small portion of important pixels onto a white background still hinders the SN from achieving the expected high similarity score. This behavior may indicate that grayscale images pose a more challenging task for SNs, or additional training may be necessary. For instance, in the OGT dataset, training samples could be augmented by generating unfilled character outlines. This augmentation phase would enable the network to learn character shapes while emphasizing the concept of a prominent white background. By doing so, the SN can not only capture the complete character shape but also consider that the background has a significance.

While sinex serves as a valuable tool for comprehending why SNs classify a query sample with specific labels, its explanations remain localized, addressing each C-way k-shot independently. Therefore, human examination is necessary to integrate sinex explanations and gain a broader understanding of the network’s behavior. With respect to its current stage, it is left to human-agents to go through different explanations to gain a more global insight of the network’s behavior. Explainability is crucial for establishing trust in few-shot learning models, ensuring their safe deployment in real-world applications. For example, we can consider the context of a few-shot learning intrusion detection system for railway video surveillance, as examined in  Gong et al. (2021). While the authors did not develop it using SNs, such an approach remains entirely feasible. In such a scenario, sinex can help evaluate the system’s robustness. For instance, consider a situation where miss-classifications of night intruders are associated with shadows in the pixel area related to the sky. Through a sinex analysis of the system’s classifications, segments of pixels related to these shadows might be revealed. Upon human evaluation of sinex explanations, a decision may be made that it is not safe to deploy the system in its current state.

7 Conclusion

We have introduced sinex, a local data-agnostic post-hoc explainer for Siamese Networks able to process images, time-series and audio inputs in the context of C-way k-shot learning. By using a perturbation-based approach, both sinex, and its coalition version sinexc, are able to identify the important areas for SN classification, covering both positive and negative contributing features. sinex enabled us to discover some of the limitations that SNs might encounter during their classification process, such as the significance of colors (on RGB images) or specific pixels areas (on grayscale images) that should not be considered important. Therefore, sinex provides an effective tool to highlight such limitations and guide a subsequent model re-training phase. Finally, the proposed wpe and wne metrics have allowed us to verify the independence of positive influencing segments and the robustness of positive and negative segments on all datasets, regardless of both correct and incorrect classifications.

Future research directions will focus on various aspects. First, since a limitation of sinex is that it can only study the SN behavior locally on each few-shot task, requiring human oversight in multiple analyses of different tasks to get a comprehensive understanding of the network’s global behavior, inspired by Setzu et al. (2021) we aim at proposing a local-to-global abstraction of the logic learned by SN. Second, by combining sinex explanations with wpe and wne, we would like to identify those tasks and behavior that go against the verified segment dependence and might reveal miss-classifications. Third, we would like to leverage the level of the explanations by linking the influencing positive and negative segments on the support set with positive/negative segments on the query image. Finally, we plan to conduct an extrinsic interpretability evaluation of sinex explanations through a human decision-making task driven by its explanations. This will help us objectively evaluate the effectiveness of our explanations.