Memory Wrap: a Data-Efficient and Interpretable Extension to Image Classification Models

Due to their black-box and data-hungry nature, deep learning techniques are not yet widely adopted for real-world applications in critical domains, like healthcare and justice. This paper presents Memory Wrap, a plug-and-play extension to any image classification model. Memory Wrap improves both data-efficiency and model interpretability, adopting a content-attention mechanism between the input and some memories of past training samples. We show that Memory Wrap outperforms standard classifiers when it learns from a limited set of data, and it reaches comparable performance when it learns from the full dataset. We discuss how its structure and content-attention mechanisms make predictions interpretable, compared to standard classifiers. To this end, we both show a method to build explanations by examples and counterfactuals, based on the memory content, and how to exploit them to get insights about its decision process. We test our approach on image classification tasks using several architectures on three different datasets, namely CIFAR10, SVHN, and CINIC10.


Introduction
In the last decade, Artificial Intelligence has seen an explosion of applications thanks to advancements in deep learning techniques.Despite their success, these techniques suffer from some important problems: they require a lot of data to work well, and they act as black boxes, taking an input and returning an output without providing any explanation about that decision.The lack of transparency limits the adoption of deep learning in important domains like health-care and justice, while the data requirement makes harder its generalization on real-world tasks.Few-shot learning methods and explainable artificial intelligence (XAI) approaches address these problems.The former studies the data requirement, experimenting on a type of machine learning problem where the model can only use a limited number of samples; the latter studies the problem of transparency, aiming at developing methods that can explain, at least partially, the decision process of neural networks.While there is an extensive literature on each topic, few works explore methods that can be used both on low data regime and that can provide explanations about their outputs.This paper makes a little step in both directions, proposing Memory Wrap, an approach that makes image classification models more data-efficient by providing, at the same time, a way to inspect their decision process.In classical settings of supervised learning, models use the training set only to adjust their weights, discarding it at the end of the training process.Instead, we hypothesize that, in a low data regime, it is possible to strengthen the learning process by re-using samples from the training set during inference.Taking inspiration from Memory Augmented Neural Networks [6,25], the idea is to store a bunch of past training samples (called memory set) and combine them with the current input through sparse attention mechanisms to help the neural network decision process.Since the network actively uses these samples during inference, we propose a method based on inspection of sparse content attention weights to extract insights and explanations about its predictions.
We test our approach on image classification tasks using CIFAR10 [13], Street View House Number (SVHN) [21], and CINIC10 [4] obtaining promising results.Our contribution can be summarized as follows: • we present Memory Wrap, an extension for image classification models that uses a memory containing past training examples to enrich the input encoding; • we show it makes the original model more data-efficient, achieving higher accuracy on low data regimes; • we discuss methods to make their predictions more interpretable.In particular, we show that not only it is possible to extract the samples that actively contribute to the prediction, but we can also measure how much they contribute.Additionally, we show a method to retrieve similar examples from the memory that allow us to inspect which features are important for the current prediction, in the form of explanation by examples and counterfactuals.
The manuscript is organized as follows.Section 2 reviews existing literature, focusing on works that use similar methods to us and discuss the state-of-the-art in network explainability; Section 3 introduces our approach, while Section 4 presents some experiments and their results.Finally, we discuss conclusions, limitations and future directions.

Memory Augmented Neural Networks
Our work has been inspired by current advances in Memory Augmented Neural Networks (MANNs) [6,7,14,25].MANNs use an external memory to store and retrieve data during input processing.They can store past steps of a sequence, as in the case of recurrent architectures for sequential tasks, or they can store external knowledge in form of a knowledge base [5].Usually, the network interacts with the memory through attention mechanisms, and it can also learn how to write and read the memory during the training process [6].Differentiable Neural Computers [7] and End-To-End Memory Networks [25] are popular examples of this class of architectures.Researchers apply them to several problems like visual question answering [19], image classification [2], and meta-learning [23], reaching great results.
Similarly to MANNs, Matching Networks [29] use a set of never seen before samples to boost the learning process of a new class in one-shot classification tasks.Differently from us, their architecture is standalone and it applies the product of attention mechanisms on the labels of the sample set in order to compute the final prediction.Conversely, Prototypical Networks [24] use samples of the training set to perform metric learning and to return predictions based on the distance between prototypes in the embedding space and the current input.Our approach relies on similar ideas, but it uses a memory set that contains already seen and already learned examples in conjunction with a sparse attention mechanism.While we adopt a similarity measure to implement our attention mechanism, we do not use prototypes or learned distances: the network itself learns to choose which features should be retrieved from each sample and which samples are important for a given input.Moreover, our method differs from Prototype Networks because it is model agnostic and can be potentially applied to any image classification model without modifications.

Explainable Artificial Intelligence
Lipton [16] distinguishes between transparent models, where one can unfold the chain of reasoning (e.g.decision trees), and post-hoc explanations, that explain predictions without looking inside the neural network.The last category includes explanation by examples and counterfactuals, which are the focus of our method.
Explanations by examples aim at extracting representative instances from given data to show how the network works [1].Ideally, the instances should be similar to the input and, in classification settings, predicted in the same class.In this way, by comparing the input and the examples, a human can extract both similarities between them and features that the network uses to return answers.
Counterfactuals are specular to explanations by examples: the instances, in this case, should be similar to the current input but classified in another class.By comparing the input to counterfactuals, it is possible to highlight differences and to extract edits that should be applied to the current input to obtain a different prediction.While for tabular data it is feasible to get counterfactuals by changing features and at the same time to respect domain constraints [20], for images and natural language processing the task is more challenging.This is due to the lack of formal constraints and to the extremely large range of features to be changed.
Recent research on explanation by examples and counterfactuals adopts search methods [30,18], which have high latency due to the large search space, and Generative Adversarial Networks (GANs).For example, Liu et al. [17] use GANs to generate counterfactuals for images, but -since they are black-boxes themselves -it is difficult to understand why a particular counterfactual is a good candidate or not.
For small problems, techniques like KNN and SVM [3] can easily compute neighbors of the current input based on distance measures, and use them as example-based explanations.Unfortunately, for problems involving a large number of features and neural networks, it becomes less trivial to find a correct distance metric that both takes into account the different feature importance and that is effectively linked to the neural network decision process.An attempt in this direction is the twinsystem proposed by Kenny and Keane [11], which combines case-based reasoning systems (CBR) and neural networks.The idea is to map the latent space or neural weights to white-box case-based reasoners and extract from them explanations by examples.With respect to these approaches, our method is intrinsic, meaning that is embedded inside the architecture and, more importantly, it is directly linked to the decision process, actively contributing to it.Our method does not require external architectures like GANs or CBR and it does not have any latency associated with its use.

Memory Wrap
This section describes the architecture of Memory Wrap and a methodology to extract example-based explanations and counterfactuals for its predictions.

Architecture
Memory Wrap extends existing classifiers, specialized in a given task, by replacing the last layer of the model.Specifically, it includes a sparse content-attention mechanism and a multi-layer perceptron that work together to exploit the combination of an input and a bunch of training samples.In this way, the pre-existent model acts as an encoder, focused on extracting input features and mapping them into a latent space.Memory Wrap stores previous examples (memories) that are then used at inference time.The only requirement for the encoder is that its last layer -before the Memory Wrap Figure 2: Sketch of the system architecture.The system encodes the input and a bunch of training samples using a chosen neural network.Then, it generates a memory vector as a weighted sum of the memory set based on the sparse content attention weights between the encodings.Finally, the last layer predicts the input class, taking as input the concatenation of the memory vector and the input encoding.
-outputs a vector containing a latent representation of the input.Clearly, the structure of the encoder impacts on the representation power, so we expect that a better encoder architecture could improve further the performance of Memory Wrap.
More formally, let be g(x) the whole model, f (x) the encoder, x i the current input, and S i = {x i m1 , x i m2 , .., x i mn } a set of n samples called memory set, randomly extracted from the training set during the current step i.First of all, the encoder f (x) encodes both the input and the memory set, projecting them in the latent space and returning respectively: Then, Memory Wrap computes the sparse content attention weights as the sparsemax [22] of the similarity between the input encoding and memory set encodings, thus attaching a content weight w j to each encoded sample m i j .We compute content attention weights using the cosine similarity as in Graves et al. [7], replacing the sof tmax function with a sparsemax.
(2) Since we are using the sparsemax function, the memory vector only includes information from few samples of the memory.In this way, each sample contributes in a significant way, helping us to achieve output explainability.Similarly to [7], we compute the memory vector v S i as the weighted sum of memory set encodings, where the weights are the content attention weights: Finally, the last layer l f takes the concatenation of the memory vector and the encoded input, and returns the final output The role of the memory vector is to enrich the input encoding with additional features extracted from similar samples, possibly missing on the current input.On average, considering the whole memory set and thanks to the cosine similarity, strong features of the target class will be more represented than features of other classes, helping the network in the decision process.In our case, we use a multi-layer perceptron with only one hidden layer as a final layer, but other choices are possible (App.A.2).

Getting explanations
We aim at two types of explanations: explanation by examples and counterfactuals.The idea is to exploit the memory vector and content attention weights to extract explanations about model outputs, in a similar way to La Rosa et al. [15].To understand how, let's consider the current input x i , the current prediction g(x i ), and the encoding matrix M Si of the memory set, where each m i j ∈ M Si is associated with a weight w j .We can split the matrix M Si into three disjoint sets } contains encodings of samples predicted in the same class g(x i ) by the network and associated with a weight w j > 0, } contains encodings of samples predicted in a different class and associated with a weight w j > 0, and M z contains encodings of samples associated with a weight w j = 0. Note that this last set does not contribute at all to the decision process and it cannot be considered for explainability purposes.Conversely, since M e and M c have positive weights, they can be used to extract explanation by examples and counterfactuals.
Let's consider, for each set, the sample x i mj associated with the highest weight.A high weight of w j means that the encoding of the input x i and the encoding of the sample x i mj are similar.If x i mj ∈ M e , then it can be considered as a good candidate for an explanation by example, being an instance similar to the input and predicted in the same class, as defined in Sect.2.2.Instead, if x i mj ∈ M c , then it is considered as a counterfactual, being similar to the input but predicted in a different class.Finally, consider the sample x i m k associated with the highest weight in the whole set M Si .Because it is the highest, it will be heavily represented in the memory vector that will actively contribute to the inference, being used as input for the last layer.This means that common features between the input and the sample x i m k are highly represented and so they constitute a good explanation.Moreover, if x i m k was a counterfactual, because it is partially included in the memory vector, it is likely that it will be the second or third predicted class, giving also information about "doubts" of the neural network.

Results
This section first describes the experimental setups, and then it presents and analyzes the obtained results for both performance and explanations.

Setup
We test our approach on image classification tasks using the Street View House Number (SVHN) [21], CINIC10 [4] and CIFAR10 [13] datasets.For the encoder f (x), we run our tests using ResNet18 [8], EfficientNet B0 [28], MobileNet-v2 [9], and other architectures whose results are reported in App.A.5.We randomly split the training set to extract smaller sets in the range {1000,2000,5000}, thus simulating a low regime data setting, and then train each model using these sets and the whole dataset.At each training step, we randomly extract 100 samples from the training set and we use them as memory set -∼10 samples for each class (see App. A.7 and App. A. 6 for further details about this choice).We run 15 experiments for each configuration, fixing the seeds for each run and therefore training each model under identical conditions.We report the mean accuracy and the standard deviation over the 15 runs for each model and dataset.For further details about the training setup please consult App.A.1.

Baselines
Standard.This baseline is obtained with the classifiers f (x) without any modification and trained in the same manner of Memory Wrap (i.e.same settings and seeds).
Only Memory.This baseline uses only the memory vector as input of the multi-layer perceptron, removing the concatenation with the encoded input.Therefore, the output is given by o i = g(x i ) = l f (v S i ).In this case, the input is used only to compute the content weights, which are then used to build the memory vector, and the network learns to predict the correct answer based on them.Because of the randomness of the memory set and the absence of the encoded input image as input of the last layer, the network is encouraged to learn more general patterns and not to exploit specific features of the given image.

Performance
In low data regimes, our method outperforms the standard models in all the datasets, sometimes with a large margin (Table 1, Table 3, and Table 2).First, we can observe that the amount of gain in performance depends on the used encoder: MobileNet shows the largest gap in all the datasets, while ResNet shows the smallest one, representing a challenging model for Memory Wrap.Secondly, it depends on the dataset, being the gains in each SVHN configuration always greater than the ones in CIFAR10 and CINIC10.Regarding the baseline that uses only the memory, it outperforms the standard baseline too, reaching nearly the same performance of Memory Wrap in most of the settings.However, its performance appears less stable across configurations, being lower than Memory Wrap in some SVHN and CINIC10 settings (Table 1 and Table 3) and lower than standard models in some full dataset scenarios and in some configurations of CINIC10.These considerations are confirmed also on other architectures reported in App.A.5.We hypothesize that the additional information captured by the input encoding allow the model to exploit additional shortcuts and to reach the best performance.
Note that it is possible to increase the gap by adding more samples in the memory, at the cost of an increased training and inference time (App.A.7).Moreover, while in low data regimes standard neural networks performances show high variance, Memory Wrap seems to be a lot more stable with a lower standard deviation.
When Memory Wrap learns from the full dataset (Table 4), it reaches comparable performance most of the time.Hence, our approach is useful also when used with the full dataset, thanks to the additional interpretability opportunity provided by its structure (Section 3.2).

Explanations
From now on, we will consider MobileNet-v2 as our base network, but the results are similar for all the considered models and configurations (App.A.4 and A.8).The first step that we can do to extract insights about the decision process, is to check which samples in the memory set have positive weights -the set M c ∪ M e .Figure 3 shows this set ordered by the magnitude of content weights for four different inputs: each couple shares the same memory set as additional input, but each set of used samples -those associated with a positive weight -is different.In particular, consider the images in Figure 3a, where the only change is a lateral shift made to center different numbers.Despite their closeness in the input space, samples in memory are totally different: the first set contains images of "5" and "3", while the second set contains mainly images of "1" and few images of "7".We can infer that probably the network is focusing on the shape of the number in the center to classify the image, ignoring colors and the surrounding context.Conversely, in Figure 3b the top samples in memory are images with similar colors and different shapes, telling us that the network is wrongly focusing on the association between color in the background and color of the object in the center.This means that the inspection of samples in the set M c ∪ M e can give us some insights about the decision process.Once we have defined the nature of the samples in the memory set that influence the inference process, we can check whether the content weight ranking is meaningful for Memory Wrap predictions.To verify that this is the case, consider the most represented sample inside the memory vector (i.e. the sample x i m k associated with the highest content weight).Then, let g(x i m k ) be the prediction obtained by replacing the current input with this sample, and the current memory set S i with a new one.If Figure 4: Integrated Gradients heatmaps of the input, the explanation by example associated with the highest weight in memory, and (eventually) the counterfactual associated with the highest weight.
Each heatmap highlights the pixels that have a positive impact towards the current prediction.
the sample influences in a significant way the decision process and if it can be considered as a good proxy for the current prediction g(x i ) (i.e a good explanation by example), then g(x i m k ) should be equal to g(x i ).Therefore, we set the explanation accuracy as a measure that checks how many times the sample in the memory set with the highest weight is predicted in the same class of the current image.Table 5 shows the explanation accuracy of MobileNet-v2 in all the considered configurations.We observe that Memory Wrap reaches high accuracy, meaning that the content weights ranking is reliable.Additionally, its accuracy is very close to the baseline that uses only the memory, despite the fact this latter is favored by its design, meaning that the memory content heavily influences the decision process.
Clearly, the same test cannot be applied to counterfactuals, because, by construction, they are samples of a different class.However, we can inspect what happens when a counterfactual is the sample with the highest weight.We find (Table 6) that the model accuracy is much lower in these cases, meaning that its predictions are often wrong and one can use this information to alert the user that the decision process could be unreliable.
Since the memory is actively used during the inference phase, we can use attribution methods to extract further insights about the decision process (see App. A.3 for a discussion about the choice of the attribution method).For example, Figure 4 shows heatmaps obtained applying Integrated Gradients2 [26], a method that highlights the most relevant pixels for the current prediction, exploiting the gradients.For both Figure 4a and Figure 4d, the model predicts the wrong class.In the 4d case, the heatmap of the explanation by example tells us that the model focuses on bird and sky colors, ignoring the unusual shape of the airplane, very different from previously known shapes for airplanes, which are represented by the counterfactual with a very low weight and a heatmap that focuses only on the sky.Conversely in the 4c case, the model ignores colors and focuses on the head shape, a feature that is highlighted both in the input image and in the explanations.Finally, sometimes (see Figure 4b) counterfactuals are missing, and this means that the model is sure about its prediction and it uses only examples of the same class.

Conclusion and future research
In this paper, we presented an extension for neural networks that allows a more efficient use of the training dataset in settings where few data are available.Moreover, we propose an approach to extract explanations based on similar examples and counterfactuals.Future work could explore the reduction of current limitations, like the memory space needed to store the memory samples and their gradients (App.A.6). Another limitation is that the memory mechanism based on similarity could amplify the bias learned by the encoder.As shown in Sect.3.2, the identification of such an event is straightforward, but currently there are not countermeasures against it.A new adaptive or algorithmic selection mechanism of memory samples or a regularization method could mitage the bias and it could improve the fairness of Memory Wrap.Finally, the findings of this paper open up also possible extensions on different problems like semi-supervised learning, where the self uncertainty detection of Memory Wrap could be useful, and domain adaption.We test our approach on image classification tasks using the Street View House Number (SVHN) dataset [21] (GNU 3.0 license), CINIC10 [4](MIT license) and CIFAR10 [13](MIT license).SVHN is a dataset containing ∼73k images of house numbers in natural scenarios.The goal is to recognize the right digit in the image.Sometimes some distracting digits are present next to the centered digits of interest.CIFAR10 is an extensively studied dataset containing ∼60k images where each image represents one of the 10 classes of the dataset.Finally, CINIC10 is relatively new dataset containing ∼90k images that tries to bridge the gap betwen CIFAR10 and ImageNet in terms of difficulty, using the same classes of CIFAR10 and a subset of merged images from both CIFAR10 and ImageNet.

A Appendix
At the beginning of our experiments, we randomly extract from training sets a validation test of 6k images for each dataset.The images are normalized and, in CIFAR10 and CINIC10, we also apply an augmentation based on random horizontal flips.We do not use the random crop augmentation because, in some preliminary tests, it can hurt the performance, as a random crop can often isolate a portion of the image containing only the background.The memory in this case will retrieve similar examples based only on the background, pushing the network to learn useless shortcuts, degrading the performance.
The subsets of the training dataset to train models with 1000, 2000 and 5000 samples are extracted randomly and change in every run.This means that we extract 15 different subsets of the dataset and then test all the configurations on these subsets.We fixed the seed using the range (0,15) to make the results reproducible.

A.1.2 Training details.
The implementation of the architectures for our encoders f (x) starts from the PyTorch implementations of Kuang Liu3 .To train the models, we follow the setup of Huang et al. [10], where they are trained for 40 epochs in SVHN and 300 epochs in CIFAR10.In both cases, we apply the Stochastic Gradient Descent (SGD) algorithm providing a learning rate that starts from 1e-1 and decreases by a factor of 10 after 50% and 75% of epochs.Note that this configuration is not optimal neither for baselines nor for Memory Wrap and you can reach higher performance on both cases by choosing another set of hyperparameters tuned in each setting.However, this makes quite fair the comparison across different models and datasets.We ran our experiment using a cloud hosted NVIDIA A100 and a GTX 3090.
Memory Set Regarding memory samples, in an ideal setting one should provide a new memory set for each input during the training process, however this makes both the training and the inference process slowe due to computational limits.We simplified the process by providing a single memory set for each new batch.The consequence is that performance at testing/validation time can be influenced by the batch size used: a high batch size means a high dependency on the random selection.To limit the instability, we fix a batch size at testing time of 500 and we repeat the test phase 5 times, extracting the average accuracy across all repetitions.
In this section, we describe and motivate the choice of the parameters of the last layer.In principle, we can use any function as the last layer.In some preliminary tests, we compared a linear layer against a multi-layer perceptron.We found that linear layers require lower learning rates (in the range of [1e-2,1e-4]) to work well in our settings.However, for the considered datasets and models, the standard configuration requires a decreasing learning rate that starts from high values.To make the comparison fair, we choose, instead, a multi-layer perceptron that seems more stable and reliable at high learning rates.The choice of a linear layer is appealing, because it makes easier the inspection of the contribution of each sample in the memory to compute the final prediction, and in principle, one could obtain similar or higher results if hyperparameters are suitably tuned.
We use a multi-layer perceptron containing only 1 hidden layer.The input dimension of such a layer will be clearly dim(l f ) = 2dim(e xi ) being dim(e xi ) = dim(v S i ) for the Memory Wrap and dim(l f ) = dim(v S i ), for the baseline that uses only the memory vector.The size of the hidden layer dim(h l f ) is a hyper-parameter that we fix multiplying the input size by a factor of 2.

A.3 Attribution Methods.
As described in the paper, it is possible to use an attribution method to highlight the most important pixels for both the input image and the memory set, with respect to the current prediction.The only requirement is that the attribution method must support multi-input settings.We use the implementation of Integrated Gradients [26] provided by the Captum library [12].Note that, one of the main problems of these attribution methods is the choice of the baseline [26]: it should represent the absence of information.In the image domain, it is difficult to choose the right baseline, because there is a high variability of shapes and colors.We selected a white image as the baseline, because it is a common background on SVHN dataset, but this choice generates two effects: 1) it makes the heatmaps blind to white color and this means, for example, that heatmaps for white numbers on a black background focus on edges of numbers instead of the inner parts; 2) it is possible to obtain a different heatmap by changing the baseline.
Table 7 shows the complete set of experiments for the computation of the explanation accuracy.Table 8 and Table 9 show the performance of GoogLeNet [27], DenseNet [10], and ShuffleNet [31] on both datasets.We can observe that the performance trend follows that of the other architectures.In this section, we describe briefly the changes in the computational cost when adding the Memory Wrap.
The network size's increment depends mainly on the output dimensions of the encoder and on the choice of the final layer.In Table 11 we examine the case of an MLP as the final layer and MobileNet, ResNet18, or EfficientNet as the encoder.We replace a linear layer of dim (a, b) with a MLP with 2 layers of dimension (a, a × 2) and (a × 2, b) passing from a*b parameters to a × (a × 2) + (a × 2) × b.So the increment is mainly caused by the a parameter.A possible solution to reduce the number of parameters would be to add a linear layer between the encoder and the Memory Wrap that projects data in a lower dimensional space, preserving the performance as much as possible.Regarding the space required for the memory, in principle, we should provide a new memory set for each input during the training process.Let be m the size of memory and n the dimension of the batch, the new input will contain m × n samples in place of n.For large batch sizes and a large number of samples in memory, this cost can be too high.To reduce its memory footprint, we simplified the process by providing a single memory set for each new batch, maintaining the space required to a more manageable m + n.

A.6.3 Time Complexity
Time complexity depends on the number of training samples included in the memory set.In our experiments we used 100 training samples for each step as a trade-off between performance and training time, doubling the training time due to the added gradients and the additional encoding of the memory set.However, in the inference phase, we can obtain nearly the same time complexity by fixing the memory set a priori and computing its encodings only the first time.

A.7 Impact of Memory Size
The memory size is one of the hyper-parameters of Memory Wrap.We chose empirically a value (100) that is a trade-off between the number of samples for each class (10), the minimum number of samples considered in the training set (1000), the training time and the performance.The value is motivated by the fact that we want enough samples for each class to get more representative samples for that class, but, at the same time, we don't want that often the current sample is also included in the memory set and the architecture exploits this fact.
Increasing the number of samples can increase the performance too (Table 12), but it comes at the cost of training and inference time.For example, an epoch of EfficientNetB0, trained using 5000 samples, lasts ~9 seconds when the memory contains 20 samples, ~16 seconds when the memory contains 300 samples and ~22 seconds when the memory contains 500 samples.Table 13 shows the accuracy reached by the models on inputs where the sample in memory associated with the highest weight is a counterfactual.In these cases, models seem unsure about their predictions, making a lot of mistakes with respect to classical settings.This behavior can be observed on ∼10% of the testing dataset.

Figure 1 :
Figure 1: Overview of Memory Wrap.The encoder takes as input an image and a memory set, containing random samples extracted from the training set.The encoder sends their latent representations to Memory Wrap, which outputs the prediction, an explanation by example, and a counterfactual, exploiting the sparse content attention between inputs encodings.

Figure 3 :
Figure 3: Inputs (first rows), their associated predictions and an overview of the samples in the memory set that have an active influence on the decision process -i.e. the samples on which the memory vector is built -(second row).

Figure 5 :
Figure 5: Inputs from CIFAR10 dataset (first rows), their associated predictions, and an overview of the samples in memory that have an active influence on the decision process -i.e. the samples from where the memory vector is built -(second row).

Figure 6 :
Figure 6: Inputs from SVHN dataset (first rows), their associated predictions, and an overview of the samples in memory that have an active influence on the decision process -i.e. the samples from where the memory vector is built -(second row).

Figure 7 :Figure 8 :
Figure7: Inputs from CINIC10 dataset (first rows), their associated predictions, and an overview of the samples in memory that have an active influence on the decision process -i.e. the samples from where the memory vector is built -(second row).

Figure 9 :
Figure 9: Heatmaps computed by the Integrated Gradients method for both the current input and the most relevant samples in memory on the SVHN dataset.

Figure 10 :
Figure 10: Heatmaps computed by the Integrated Gradients method for both the current input and the most relevant samples in memory on the CINIC10 dataset.

Table 1 :
Avg. accuracy and standard deviation over 15 runs of the baselines and Memory Wrap, when the training dataset is a subset of SVHN.For each configuration, we highlight in bold the best result and results that are within its margin.

Table 2 :
Avg. accuracy and standard deviation over 15 runs of the baselines and Memory Wrap, when the training dataset is a subset of CIFAR10.For each configuration, we highlight in bold the best result and results that are within its margin.ResNet18Standard 40.03 ± 1.36 48.86 ± 1.57 65.95 ± 1.77Only Memory 40.35 ± 0.89 51.11 ± 1.22 70.28 ± 0.80 Memory Wrap 40.91 ± 1.25 51.11 ± 1.13 69.87 ± 0.72

Table 4 :
Avg. accuracy and standard deviation over 15 runs of the baselines and Memory Wrap, when the training datasets are the whole SVHN and CIFAR10 datasets.For each configuration, we highlight in bold the best result and results that are within its margin.Only Memory 95.82 ± 0.10 91.36 ± 0.24 81.65 ± 0.19 Memory Wrap 95.58 ± 0.06 91.49 ± 0.17 82.04 ± 0.16 ResNet18Standard 95.79 ± 0.18 91.94 ± 0.19 82.05 ± 0.25

Table 5 :
Mean Explanation accuracy and standard deviation over 15 runs of the sample in the memory set with the highest sparse content attention weight.

Table 6 :
Accuracy reached by the model on images where the sample with the highest weight in memory set is a counterfactual.The accuracy is computed as the mean over 15 runs using as encoder MobileNet-v2.

Table 7 :
Mean Explanation accuracy and standard deviation over 15 runs of the sample in the memory set with the highest sparse content attention weight.

Table 8 :
Avg. accuracy and standard deviation over 15 runs of the baselines and Memory Wrap, when the training dataset is a subset of SVHN.For each configuration, we highlight in bold the best result and results that are within its margin.

Table 11 :
Number of parameters for the models with and without Memory Wrap.The column dimension indicates the number of output units of the encoder.

Table 13 :
Accuracy reached by the model on images where the sample with the highest weight in memory set is a counterfactual.The accuracy is computed as the mean over 15 runs.