Match Them Up: Visually Explainable Few-shot Image Classification

Few-shot learning (FSL) approaches are usually based on an assumption that the pre-trained knowledge can be obtained from base (seen) categories and can be well transferred to novel (unseen) categories. However, there is no guarantee, especially for the latter part. This issue leads to the unknown nature of the inference process in most FSL methods, which hampers its application in some risk-sensitive areas. In this paper, we reveal a new way to perform FSL for image classification, using visual representations from the backbone model and weights generated by a newly-emerged explainable classifier. The weighted representations only include a minimum number of distinguishable features and the visualized weights can serve as an informative hint for the FSL process. Finally, a discriminator will compare the representations of each pair of the images in the support set and the query set. Pairs with the highest scores will decide the classification results. Experimental results prove that the proposed method can achieve both good accuracy and satisfactory explainability on three mainstream datasets.


Introduction
Few-shot learning (FSL) is of great significance at least for the following two scenarios [43]: Firstly, FSL can relieve the heavy needs for data gathering and labeling, which can boost the ubiquitous use of deep learning techniques, especially for users without enough resources.Secondly, FSL is an important solution for applications in which rare cases matter or image acquisition is costly because of high operation difficulty or ethical issues.Typical examples of such applications include computer assisted diagnosis with medical imaging, classification of endangered species, etc.
There have been lots of FSL methods, most of which  Images are from mini-ImageNet dataset [39].
are based on the assumption that knowledge can be well extracted from base (seen) classes and transferred to novel (unseen) classes.However, this is not always the case.The knowledge in a pre-trained backbone convolutional neural network (CNN), which computes features of an input image, may sometimes be useless when novel categories have significant visual differences from images of the base categories [46].What makes matter worse is that we even have no way to see if the visual differences between the base and novel categories are significant for an FSL model.This raised one essential question: Is there any way to see what is actually transferred?Actually, in the FSL task, most works only treat the convolutional layer as the image embedding tool, and do not pay attention to the reasons for the extracted features.In this paper, we redesign the mechanism of knowledge transfer for FSL tasks, which offers an answer to the above question.Our approach is inspired by what human beings actually do when trying to recognize a rarely seen object.That is, we usually try to find some patterns in the object and match them in a small number of seen examples in our memory.We adopt a recently-emerged explainable classifier, called SCOUTER [20], and propose a new FSL method, named match-them-up network (MTUNet) consisted of pattern extractor (PE) and pairwise matching (PM).
PE is designed for finding discriminative and consistent patterns for image representation.The knowledge transferred from the base categories to the novel categories is the learned patterns.Owing to the explainability of SCOUTER, the extracted patterns themselves can be easily visualized by exemplifying them in the images as shown in Figure 1(a).This directly means that we have a way to see what is actually transferred in our FSL pipeline.The patterns extracted in each of the support and query images are aggregated to form discriminative image representation (overall attention), which is used for matching.As shown in Figure 1(b), the visualization of aggregated patterns collectively shows a consistent and meaningful clue for the images of the same category.For example, PE shows strong attention on the neck of the goose in the second column, which is consistent in both support and query images.Image representation based on the patterns from base categories makes matching between a pair of images much easier by incorporating only a small number of regions to pay attention to.
On top of the PE, PM is adopted to judge whether image pairs belong to the same category or not.Each pair consists of one image from a support set and one image from a query set.The category of the support image that gives the highest score is regarded as the query image's category.Together with PE, MTUNet can provide a matching score matrix to further relate the visualization and the model decision.
The main contributions of our work include:

Few-shot Learning
Recently, deep neural networks have achieved outstanding performance in various classification tasks, thanks to the availability of a sufficient number of images for each category.Such large datasets usually require lots of effort for their creation, and some tasks, such as medical tasks [27,7], may not inherently have enough supervising signals.For these tasks, we need a new paradigm that allows learning a model with a small number of labeled images.The popular FSL tasks [39,35,23] serve as a testbed for some certain aspects of such small tasks.Recent efforts toward FSL are summarized as follows.
Image Embedding and Metric Learning.Many works focus on how to transform images into vectors in embedding space, in which the distance between a pair of vectors represents the conceptual dissimilarity.An early approach uses Siamese networks [18] as a shared feature extractor to produce image embeddings for both support and query images.The weighted 1 distance is used for the classification criterion.Some use a multi-layer perceptron (MLP) to parameterize and learn a classifier [17,10,37].Metric learning can offer a better way to train the mapping into the embedding space [39,35].Some works try to improve the discriminatory power of image embeddings.Simple Shot [41] applies 2 normalization and Central method to make the distance calculation easier.
Meta-learning.Another major approach to FSL is to optimize models so that they can rapidly adapt to new tasks.It is a good thing that adapting feature extractor to new tasks at the novel test time.Fine-tuning transfer-learned networks [45] fine-tune the feature extractor using the task-specific support images.MAML [8] and its extensions [29,25] train a set of initialization parameters, and through one or more steps of gradient adjustment on the basis of the initial parameters, they can be easily adapted to a new task with only a small amount of data.Besides training a good parameter initialization, Meta-SGD [21] also trains the parameter update direction and step size.
Data Augmentation.Data augmentation aims at introducing immutability to models to capture in both image and feature levels [30,4].Some works try to use samples which are weakly labeled or unlabeled [6,26].ICI [42] introduces a judgment mechanism to increase training samples.It is always worth increasing the training set by utilizing the unlabeled data with confidently predicted labels.In general, solving an FSL problem by augmenting D train is straightforward and easy to understand.
Transductive or Semi-supervised Paradigm.Transductive or semi-supervised approaches [15,5]   tics across the few-shot tasks, which assumes that all novel images are ready beforehand.We only employ the original inductive paradigm to explore explainable feature extraction, but our idea can be easily adopted to the transductive paradigm.

Explainable AI
Deep neural networks are considered as a black-box technology, and explainable artificial intelligence (XAI) is a series of attempts to unveil them.Most XAI methods for classification tasks are based on back-propagation [40,34,33,3] or perturbation [32].All these methods are post-hoc, which can not be added to the model structure during training.
A few works [36,11,16] have tried XAI for FSL tasks.Geng et al. [11] uses a knowledge graph to make an explanation for zero-shot tasks.Sun et al. [36] adopts layerwise relevance propagation (LRP) [1] to explain the output of a classifier.StarNet [16] realize visualization through heat maps derived from back-project.Recently, a new type of XAI, coined SCOUTER [20], has been proposed, which applies the self-attention mechanism [38] to the classifier.This method can further extract the discriminant attentions for each category during training, which makes classification results explainable.We apply this technique to FSL tasks in order to explore a new explainable FSL paradigm.PE provides insights on why it classified an input image into a certain novel category.

Problem Definition
This paper addresses an inductive FSL task (c.f ., transductive one [5,15]), in which we are given two disjoint sets D base and D novel of samples.The former is the base set that includes categories (C base ) with many labeled images.The latter is the novel set and include categories (C novel ) with a few labeled images.C base and C novel are disjoint.The FSL task is to find a mapping from a novel image x into the corresponding category y.
The literature typically uses the K-way N -shot episodic paradigm for training/evaluating FSL models.For each episode, we sample two subsets of D base for training, namely, support set S = {(x i , y i )|i = 1, . . ., K × N } and query set Q = {(x q i , y q i )|i = 1, . . ., K × M }.These images are of the same K categories in C base , and we sampled the same numbers of images (N images for the support set and M images for the query set).An FSL model is trained so that it can find a match between images in Q (with abuse of notation) and S. The image in Q is classified as the category of the matched image in S.

Overview
The overall process is illustrated in Figure 2. In each episode, we extract feature maps F = f θ (x) ∈ R c×h×w from image x in both S and Q using backbone convolutional neural network f θ , where θ is the set of learnable parameters.F is then fed into the pattern extractor (PE) module, f φ , with learnable parameter set φ.This module gives attention A = f φ (F ) ∈ R z×l over F .Our pair matching (PM) module uses an MLP to compute the score of query image x q ∈ Q belonging to the category of x's in S.
PE plays a major role in the FSL task.PE is designed to learn a transferable attention mechanism.This ends up in finding common patterns that are shared among different episodes sampled from D base .Consequently the patterns are shared also among D novel given that D base and D novel are from similar domains.

Pattern Extractor
SCOUTER is originally designed as an explainable classifier, of which decision is based directly on the existence of certain learned patterns in an image [20].It is built upon the self-attention mechanism [38] to efficiently find common patterns in images of a certain category.In the context of FSL, we extend this idea to find common patterns to efficiently differentiate given sets of categories even across different episodes.The presence of a certain combination of these learned patterns gives a strong clue on the category of the image, facilitating classification even of novel categories.We implement our PE on top of SCOUTER.
The basic idea of PE is to find common patterns through the self-attention mechanism.Input feature maps F is firstly fed into a 1 × 1 convolution layer followed ReLU nonlinearity to squeeze the dimensionality of F from c to d.We flatten the spatial dimensions of the squeezed features to form F ∈ R d×l , where l = hw.To maintain the spatial information, position embedding P [44,22,20] is added to the features, i.e., F = F + P .
The self-attention mechanism gives the attention over F for the spatial dimension by the dot-product similarity between a set of z learned patterns W ∈ R z×d (z is the number of the patterns) and F after nonlinear transformations g Q and g K .PE repeats this process with updating the pattern with a gated recurrent unit (GRU) to refine the attention.That is, given for the t-th repetition, the attention is given using certain normalization function ξ by Patterns W (t) is updated T times (i.e., t = 1, . . ., T ) by PE adopts a different normalization strategy from the original SCOUTER.Let Softmax R (X) and σ(X) be softmax over respective row vectors of matrix X and sigmoid.SCOUTER normalizes the attention map only over the flattened spatial dimensions, i.e., This allows finding multiple patterns in a single image.MTUNet further modulates this map by which suppresses weak attention over different patterns at the same spatial location, where is the Hadamard product.This enforces the network to find more specific yet discriminative patterns with fewer correlations among them and thus ends up with more pinpoint attentions.The learned patterns can be more responsive in different images with this modulation as an attention map only responds to a single pattern that does not include its peripheral region.The input image is finally described by the overall attention corresponding to the extracted patterns, given by where 1 z is the row vector with all z elements being 1.A (T ) is reshaped the l into the same spatial structure as F .V will then undergo an average pooling among spatial dimension and only keep the channel dimension c.

Pairwise Matching
An FSL classification can be solved by finding the membership of the query to one of the given support images.Some FSL methods use metric learning [35,39] to find matches between the query and the supports.The cosine similarity or the 2 distance are the typical choices [12,41].Learnable distances are another popular choice for the metric learning-based FSL methods [17,10,37].We use a learnable distance with an MLP.Let V q be features of query image x q ∈ Q and V kn of support image x kn ∈ S, where the subscripts k and n stand for the n-th image of category k.For n > 1, the average over the n images are taken to generate representative feature Vk ; otherwise (i.e., n = 1), Vk = V k1 .For computing the membership score s of query image x q to category k, we use MLP f γ with learnable parameters γ: where [•, •] is concatenation of two vectors for the one-toone pair and S k ⊂ S contains images of category k. x q is classified into the category with maximum s over S k for k = 1, 2, . . ., K.

Training
After pre-training of the backbone CNN f θ , we first train the PE module in the same way as [20].The area loss facilitates to find compact patterns.For this training, we further split D base into two subsets D base,T and D base,V .The former contains 90% of images of each category and the latter contains the rest.We sample z categories from D base and use images of these categories in D base,T for training.The images of the same sampled categories in D base,V is used for validation.With these sampled categories, the training is trying to find discriminative patterns together with our attention map modulation in Eq. (7).
We then train MTUNet with the backbone and the PE module fixed.For Q and S sampled from D base for each episode, we train f γ with the binary cross-entropy loss: L = − (x q ,y q )∈Q y q log(s(x q , S)), (10) where s(x q , S) = (s 1 (x q , S 1 ), . . ., s K (x q , S K )) .

Experimental Setup
Following the majority of the literature, we evaluate MTUNet on 10,000 episodes of 5-way classification created by first randomly sampling 5 categories from D base and then sampling support and query images of these categories with N = 1 or 5 and M = 15 per category.We report the average accuracy over K × M = 75 queries in the 10,000 episodes and the 95% confidence interval.
We employ two CNN architectures as our backbone f θ , which are often used for FSL tasks, namely WRN-28-10 [47] and ResNet-18 [14].For ResNet-18, we remove the first two down-sampling layers and change the kernel of the first 7 × 7 convolutional layer to 3 × 3. We use the hidden vector of the last convolutional layer after ReLU as feature maps F , where the numbers of feature maps are 512 and 640 for ResNet-18 and WRN-28-10 respectively.
As noted in [20], pre-training of the backbone CNNs is important for our PE module.We adopted a distance-based strategy, which is similar to SimpleShot [41].We train the backbone CNNs with all images in D base .The performance of a simple nearest-neighbor-based method is then evaluated over D val with 2,000 episodes of 5-way FSL task, and † Results are reported in [41]   ‡ Results are reported in [2] Table 1.Average accuracy of 10000 sampling 5-ways task on test set of mini-ImageNet.
For training the whole MTUNet, the learnable parameters in backbone CNNs and PE are frozen.In a single epoch of training, we sample 1,000 episodes of 5-way tasks.The model is trained for 20 epochs with an initial learning rate 10 −3 , which is divided by 10 at the 10-th epoch.We use the model with the best performance with 2,000 episodes sampled from D val .
Our model is implemented with PyTorch.AdaBelief [48] is adapted as optimizer.Input images are resized into 80×80, going through data augmentation including random flip and affine transformation following [41].A GPU workstation with two NVIDIA Quadro GV100 (32GB memory) GPUs are used for all experiments.

Results
MTUNet is compared with state-of-the-art (SoTA) FSL methods.We exclude ones in semi-supervised and transductive paradigms, which use the statistics of novel set across different FSL tasks.We also do not adopt any postprocessing methods like 2 normalization in [41].
We report our best model by randomly sampling 10,000 1-shot and 5-shot tasks over D test in Tables 1-3 for the three datasets.We select the category for PE pre-training by sampling every 10 categories from the base set category list.The results demonstrate that MTUNet outperforms or is comparable with SoTA methods.
The different architectures of the backbone CNNs affect the performance.The variants with WRN always give a better performance than those with ResNet-18.Asides from the difference in the network architecture, the size of feature maps may be one of the factors.For mini-ImageNet, the WRN variants has 20 × 20 feature maps, while the ResNet-18 variants has 10 × 10.Such larger feature maps not only provides more information to the PM module but also give a better basis of patterns as higher resolutions may help find more specific patterns.

Explainability
In addition to the classification performance, MTUNet is designed to be explainable in two different aspects.Firstly, MTUNet's decision is based on certain combinations of learned patterns.These patterns are localized in both query and support images through A (T ) , which can be easily visualized.This visualization offers intuition on the learned patterns and how much these patterns are shared among the query and support images.Secondly, thanks to the one-toone matching strategy formulated as a binary classification problem in Eq. ( 9), the distributions (or appearances) of learned patterns in query and support images give a strong clue on MTUNet's matching score s.
Pattern-based visual explanation.MTUNet's decision is based on the learned patterns, i.e., it is solely based on how much shared patterns (or features) appear in both a query and a support.This design in turn means that, by pinpointing each pattern in the images, we can obtain an intuition behind the decision made by the model.This can be done by merely visualizing A (T ) .
Figures 3 (a The third to ninth columns are the visualization of the regions corresponding to the learned patterns in A (T ) (i.e., the i-th row vectors of A (T ) represents the appearance of the i-th learned pattern at the respective spatial location).For (a) with category lock, the support image is a small golden combination lock used for storage cabinets or post boxes.Among all 7 patterns, only pattern 5 shows a strong response, whereas the others are not observed.We can see that pattern 5 pays attention to the discs of the lock.It also gives a strong response at the words on the left which shows similar morphological characteristics.The query image of (a) is a black combination lock often used for bicycles.The attention maps show almost the same distributions as the support: Only pattern 5 has a response on the discs.From these visualizations, we can infer that pattern 5 represents periodical changes in colors.Although these two locks have different functions, MTUNet finds a shared pattern among them.
For (b), the support image is the gymnast wearing red.Multiple patterns are observed in the image.We can see that the visualization of pattern 1 identifies the part of the human body (head), and pattern 3 appears around the hands grabbing the horizontal bar.The query image is the gymnast in blue.Patterns 1 and 3 respond in a similar way to the support image.Patters 4 and 5 appear in the background and around other parts of the body, however their responses are relatively weak compared to patterns 1 and 3. Patterns 1 and 3 may responsible for human heads and hands grabbing the horizontal bar, lead to the successful classification of novel categories.Among all pairwise combinations, the combination of the support and query images of catamaran makes the full score (100%).The visualization of overall attention covers the hulls, especially the masts, in both images, which are the main characteristics of this category, explaining the high score.Category goose gets a low matching score.The query is a close-up of a goose on the ground from its front side, which captures the goose's blackhead or beak.The support image is an overall view of a goose about to fly from its backside.The visualization of overall attention captures the leg.With this combination, finding a shared pattern may not be easy, although these two extracted patterns are both representative parts of a bird.This problem stems from the difference in viewing angles, which can be relieved in 5-shot tasks, giving more supports from different viewing angles.Surprisingly, the query image for goose gets 81% for the support image for beetle.This may suggest that one of the patterns responds to black regions and this pattern is solely used as the clue of goose.This is a negative result for FSL tasks but clearly demonstrates MTUNet's explainability on the relationship between visual patterns and the pairwise matching scores.

Discussion
The number z of patterns.The number of patterns can be another crucial factor for MTUNet.Intuitively, a larger z makes the model more discriminative.To show the impact of z, we uniformly sample categories in C base (i.e., default as sampling every I categories from the category list, where I = 10, 8, 7, 5, 4, 3, 2, and 1); thus, I = 1 ends up with using all categories in C base .
The test accuracies are shown in Figure 5 for 5-way 1shot and 5-way 5-shot tasks in 10,000 sampled episodes over D test of the three datasets.The horizontal axis represents the number of patterns and the vertical axis represents the average accuracy.Interestingly, the results show no clear tendency with respect to z.We would say that the performance is slightly decreased in mini-ImageNet with a larger z, whereas slightly increased in CIFAR-FS.For tiered-ImageNet, when setting I as 1 and use all 351 base categories for patterns, the PE module can not be trained successfully.This situation is also reported in [20].Other settings also show no obvious differences.In general, tuning over z may help gain performance, but its impact is not significant.
Selection of categories for training PE.Our PE module is supposed to learn common visual patterns.We use images of a certain subset of categories in C base to learn such patterns in our experiments.The selection of this subset thus affects the performance of downstream FSL tasks.To clarify the impact of the choice of the subset, we randomly sample seven categories in C base of mini-ImageNet for 50 times and use the corresponding images for training PE on top of ResNet-18.The trained PE is used for training MTUNet, which is evaluated over 2,000 episodes of FSL tasks with both the validation and test sets.
The mean and the 95% confidence interval over the 50 test accuracies are 54.63% and 0.16%, respectively.This implies that our model benefits from a better choice of categories for training PE.For this choice, we only have access to the validation set; however, since the validation set and the test set have disjoint categories, the best choice for the validation set is not necessarily the best choice for the test set.Figure 6 shows the scatter plot of the validation accuracies and corresponding test accuracies, over 50 different random samples of seven categories.The plot empirically shows that the validation and test accuracies are highly correlated to each other, with Pearson's correlation coefficient of 0.64.This leads to the conclusion that, at least for mini-ImageNet, we can use the validation set to find the better choice.The green square in the plot is the choice that we adopted in our experiments, which shows that it is a better choice but not the best.

Conclusion
In this paper, we propose MTUNet designed for explainable FSL.We achieved comparable performance on three benchmark datasets and qualitatively demonstrated its strong explainability through patterns in images.The approach taken in our model might be analogous to human beings as we usually try to find shared patterns when making a match between images of an object that one has never seen before.This can be advantageous as the explanation given by MTUNet can provide an intuitive interpretation of what the model actually does.Our future work includes testing our model in a practical application scenario of FSL, such as computer-assisted diagnoses.
Match Them Up: Visually Explainable Few-shot Image Classification (Supplementary Material) 1. Selection of Categories for Training PE.
We further provide the randomly sampled seven categories in C base of cifar-FS and tiered-ImageNet for 50 times and 20 times respectively.ResNet-18 is used as backbone.MTUNet with the trained PE is evaluated over 2,000 episodes of FSL tasks with both the validation and test sets.
The results are shown in Figures 1 and 2. For cifar-FS, the mean and the 95% confidence interval over the 50 test accuracies are 69.29% and 0.14%, respectively.Pearson's correlation coefficient is 0.53.For tiered-ImageNet, the mean and the 95% confidence interval over the 20 test accuracies are 61.58% and 0.20%, respectively.Pearson's correlation coefficient is 0.81.Through the result we can find that the performances over the validation and test sets show a strong correlation.These results imply that we can use the validation set to find a better choice not only for miniImageNet but also the other datasets.

Explainability
We provide visualization of patterns for 4 randomly sampled 5-way 1-shot tasks with a single query image over mini-ImageNet.The pattern-based visualization (Figures 3-6) and the pairwise matching scores (Figures 7-10) are shown for sample 1-4, respectively.We shall provide some discussion on the respective samples.Sample 1 By observing the matching matrix in Figure 7, we find there are two confusing categories of lock and carton.They all get a high score for each other category.The visualization in Figure 3 shows that pattern 5 is responsible for both the letters (or a face of the character) on the carton and the discs of the lock.We would say that the letters and the discs share some similar structures, which cause the confusion.

Sample 2
The pairwise matching scores in this sample find proper matches except for poncho.In Figure 8, the poncho image in support is a baby girl wearing a poncho, while the image in query is just the poncho with black color Test Accuracy (%) in the white background.The query image for poncho yields high scores for the support images of poncho, skirt, and beetle.The highest score of beetle may be due to the black color.Interestingly, the support and query images for skirt shows the attention over the door behind the person but not over the skirt itself.This is a good example of the importance of explanation for FSL.

Sample 3
In Figure 5, we find both the query and support give attention on the body part of the goose, but the differences in the perspective and the number of objects may make matching difficult.As a result, the query goose gets low scores for all support images.This also happens to carton in this sample.

Sample 4
In Figure 10, the query hound shows high matching scores to hound, goose, catamaran, and skirt.Through Figure 6, we find that the query hound contains a hound as well as people, which is also in the supports skirt and catamaran.This means that some learned patterns cover people, and the people in the query hound lead to high matching scores against skirt and catamaran.Inclusion of different objects often causes prediction failure.
have gained popularity, which makes great progress in the past few years.They use the statistics of query examples or statis-

Figure 2 .
Figure 2. Overall structure of MTUNet.One query is processed by CNN backbone and pattern extractor (PE) to provide exclusive patterns and then turned into an overall attention.Query will be concatenated to each support to make a pair for final discrimination through pairwise matching (PM).The dotted line represent each support image undergo the same calculation as query.

Figure 3 .
Figure 3. Visualization of each pattern and the average features for a sampled task in mini-ImageNet.a is the class of lock and c is the horizontal bar.Overall is the overall attention among all patterns.The third to ninth columns are the visualization of the re-gions corresponding to the learned patterns.
) and (b) respectively show a pair of support and query images in a 5-way task in mini-ImageNet.The pairs (a) and (b) are of categories lock and horizontal bar, respectively.The second column shows the visualization of averaged attention, given by

Figure 4 .
Figure 4. Matching point matrix of one sampled task in mini-ImageNet.Row and column are consisted with the overall attention visualization for support and query of each category.

Figure 5 .Figure 6 .
Figure5.Results of patterns number settings for mini-ImageNet, tiered-ImageNet, and CIFAR-FS.The horizontal axis represents the number of patterns and the vertical axis represents the average accuracy.We report all the results with 10,000 sampled 5-way episodes in the novel test set.
. Mini-ImageNet consists of 100 categories sampled from ImageNet with 600 images per class.These images are divided into the base D base , novel validation D val , and novel test D test sets with 64, 16, and 20 categories, respectively, where both D val and D test corresponded to D novel in Section 3.1.The images in miniImageNet are of size 84 × 84.Tiered-ImageNet consists of ImageNet 608 classes divided into 351 base classes, 97 novel validation classes, and 160 novel test classes.There are 779,165 images with size 84 × 84.CIFAR-FS is a dataset with images from CIFAR-100

Table 3 .
Average accuracy of 10000 sampling 5-ways task on test set of CIFAR-FS.
the best model is adopted.The learning rate for training starts with 10 −3 and is divided by 10 every 20 epochs.It