Weakly Correlated Knowledge Integration for Few-shot Image Classification

Various few-shot image classification methods indicate that transferring knowledge from other sources can improve the accuracy of the classification. However, most of these methods work with one single source or use only closely correlated knowledge sources. In this paper, we propose a novel weakly correlated knowledge integration (WCKI) framework to address these issues. More specifically, we propose a unified knowledge graph (UKG) to integrate knowledge transferred from different sources (i.e., visual domain and textual domain). Moreover, a graph attention module is proposed to sample the subgraph from the UKG with low complexity. To avoid explicitly aligning the visual features to the potentially biased and weakly correlated knowledge space, we sample a task-specific subgraph from UKG and append it as latent variables. Our framework demonstrates significant improvements on multiple few-shot image classification datasets.


Introduction
Deep learning approaches have achieved impressive performance on image classification tasks recently. However, most of these approaches need huge data for training. Furthermore, they are hard to be adopted to perform classification on samples from unseen classes with a limited number of examples. The challenges of learning with limited labeled data can be categorized into the fewshot learning problem. Due to the fact that annotated data can be expensive to obtain, this challenge is gaining more attention from the automation community [1−4] . In this paper, we study the popular N-way K-shot image classification task among the few-shot learning problems. Many methods introduce external knowledge to address the problem of insufficient samples, most of which adopt textual domain knowledge from label descriptions [5−10] . In particular, some works (e.g., CADA-VAE [6] , Soravit′s method [7] , and ReViSE [8] ) align the features from the visual feature domain to the textual feature domain. Many of these methods intend to work on datasets (e.g., animal with annotation [11] and CUB [12] ) that provide highly correlated and structural textual descriptions.
However, few such methods apply to datasets that only provide weakly correlated descriptions, e.g., the Mini-Im-ageNet and Tiered-ImageNet datasets. In these datasets, the label descriptions are not strongly correlated with the visual properties of the corresponding classes. It is shown in Fig. 1.
Other methods like LSFS [9] and MNE [13] exploit other information from different perspectives. For example, MNE [13] exploits information on the training set by keeping an episodic memory and fetches K nearest neighbors (KNN) to extend each sample in a task. LSFS [9] uses a hierarchical structure provided by the datasets. Here, LSFS still requires the dataset to provide an extra hierarchical annotation of different classes, while MNE does not utilize information in the label description. Integrating weakly correlated knowledge from different domain sources is still an open problem in the literature.
In this paper, we propose a weakly correlated knowledge integration (WCKI) framework which can leverage nonstructural and weakly correlated knowledge extracted from different sources (i.e., visual domain and textual domain) to improve the few-shot classification performance. An overview of our framework is shown in Fig. 2.
First, we propose a unified knowledge graph, which allows the integration of knowledge transferred from different domains. Distinctive to MNE [13] that models knowledge on the training set with a memory module with hard-wired updating policy, the unified knowledge graph allows end-to-end optimization. Also different from [14,15], our method can integrate knowledge transferred from multiple domains. In this work, we adopt two commonly used knowledge domains: textual domain knowledge [16] and visual domain knowledge collected from historical training episodes [13,14] . Since the training set mainly consists of images from the visual domain and the model is trained to align the visual features of samples from the same classes, such knowledge is considered as visual domain knowledge. Second, our model utilizes a differential graph attention module to sample more "relevant" knowledge proved to be able to improve both accuracy and efficiency. This module helps to reduce the computational complexity and improve the task-relevancy of the transferred knowledge. Different from the rule-based KNN approach in MNE, our graph attention module is differentiable and thus trainable, leading to a fully end-to-end trainable frame-work.
Finally, we take the transferred knowledge as latent variables in our framework like MNE [11] , ARML [12] , and [17] to avoid aligning explicitly sample features and weakly correlated transferred knowledge.
The contributions of this work are summarized as follows: 1) Proposing a weakly correlated knowledge integration framework which can transfer knowledge from multiple potentially biased sources to improve few-shot image classification task.
2) Proposing a unified knowledge graph to represent and index transferred knowledge adaptively for each specific task.
3) Proposing a graph attention module for adaptively sampling transferred knowledge for each specific task to reduce computing complexity and improve the task-relevancy of knowledge.

Related works
The N-way K-shot problem is a commonly researched problem in the few-show learning field. In this problem, the model is supposed to produce label predictions for each sample as output. More specifically, S contains N×K labeled samples from N classes (K per class). Q contains N×K q samples drawn from the training set and K q samples for each class in S. K q indicates the number of queries sampled for testing for each class. During the evaluation stage, a number of evaluation tasks are sampled from the testing set, and the average accuracy is used to measure the model performance. Unlike typical  Fig. 2 Overview of our framework. In our framework, knowledge extracted from different domains is modeled with the united knowledge graph. For each specific task, we sample an "optimal" subgraph. We then "merge" them with encoded sample features and use them as latent variables for the classifier to improve accuracy. Graph-based methods such as [18−20] are widely applied in few-shot learning for better modeling inter-class relations to relieve the data insufficient problem. These methods put support samples and query samples into one graph and inference the relation between query samples and support samples by processing the graph with a graph neural network. More specifically, TPN [20] propagates labels of each support sample to each query sample with a graph network. FGNN [18] and EGNN [19] model the input samples as graph nodes and the pairwise similarities by edges. The graph is updated by a graph network, and classification results are derived according to queryto-support edges. In this work, we adopt the second approach following EGNN [19] .
However, these approaches still face the challenge of insufficient information on novel classes. Different methods are proposed to transfer and utilize external knowledge to provide more information on unseen classes. Methods like [21] use Siamese networks that transfer knowledge from another potentially biased data source. Alternatively, methods like [8, 16] exploit semantic information of class labels. However, most of such methods explicitly align the semantic embedding of the text description of the label with visual features. The performance of these approaches highly depends on the quality of label description [8] and the language model used to generate semantic embedding. To address the problem, AM3 [15] , learns a "convex combination" that acts as a gate to filter out potentially biased textual domain knowledge.
Other methods, on the other hand, like MNE [13] , Castle [22] , ARML [14] , and [17] propose adopting knowledge extracted from the extra training set instead of the textual domain to enhance the classification performance. These methods use transferred knowledge as latent variables instead of aligning it with visual features of input samples, which provide some extend of robustness against bias and noise. This latent variable approach also requires no explicit correspondence between transferred knowledge data and novel classes. However, they utilize only one knowledge source and do not exploit the information of the labels. In this work, we propose a method combining two commonly used sources and adaptively sample the most relevant knowledge for each specific task.

Framework
In this paper, we propose a weakly correlated knowledge integration framework, as shown in Fig. 3. The proposed framework aims to alleviate the sample insufficiency by utilizing knowledge from weakly correlated sources. In the framework, the unified knowledge graph G k is proposed to adaptively integrate knowledge from different sources. In order to avoid introducing bias into the transferred knowledge, the transferred knowledge is used as latent variables instead of alignment references [16] . Further, a graph attention module is proposed, which adaptively samples a task-specific subgraph from G k to improve the relevance of the latent variable. More specifically, for each few-shot classification task, the encoder first encodes each image sample into a feature vector with a standard four-layer CNN [19,23,24] . Next, the observation graph G obs is constructed based on the embedding of each sample. The graph attention module then samples a task-  Pipeline of our method. Our framework integrates knowledge extracted from weakly correlated domains with a unified knowledge graph G k and adaptively uses a more relevant "subgrap" G lat as latent variables according to one specific classification task. More specifically, the input data is first encoded with a CNN encoder. Then, a latent subgraph G lat is sampled from G k for each specific task for better task relevancy. G lat is merged into the observation graph G obs , which is constructed according to the encoded labels and features. The combined graph G pre is updated with a multi-layer GNN.
relevant latent subgraph G lat from the unified knowledge graph G k according to the labels and embeddings from the samples of the supporting set. We then merge G lat and the observation graph G obs into the initial prediction G 0 pre . Like EGNN [19] , G 0 pre is iteratively updated with a multi-layer edge-feature GNN. The prediction result is obtained according to the updated edges of G L pre . Like existing methods [18,19], we model each sample as graph nodes. The similarities between samples are modeled as the edges of the graph. Query samples are classified according to their overall similarity to each class, which is computed by averaging the similarity of all support samples in the class. Key notations in this section and algorithm block diagrams of our framework are summarized in Appendix A.

Unified knowledge graph
The unified knowledge graph is proposed to integrate and model the knowledge transferred from potentially biased sources. In this paper, we adopt and integrate the knowledge extracted from the training set [13] and the glossary of all nouns in WordNet [16] . These two domains form two disjoint subgraphs of G k . To keep it concise, we denote these two sources as the visual domain knowledge and textual domain knowledge, respectively.
First, for the visual domain knowledge, slightly different from the memory module in MNE [13] , we use a trainable graph of N vis nodes shared among all training tasks, avoiding the non-differentiable procedure of updating the memory entries. Since the graph is shared and optimized for all training tasks, it can be interpreted as knowledge that provides information for classification tasks on training classes. As the training tasks mainly consist of visual information, this part is considered visual domain knowledge.
Second, we use another graph of N nlp nodes to model the textual domain knowledge extracted from WordNet glossaries. In more detail, we first use the GPT2 model to encode the label description into word vectors, which are then averaged into the corresponding label embedding.
Since the semantic embedding is extracted with a model trained on more data, we use the principal component decomposition (PCA) approach to project the semantic embedding to node features. Finally, we adopt a k-means clustering algorithm to reduce the number of nodes, where each center corresponds to a node in the graph.
Since the textual domain model is trained on a much larger dataset, corresponding node features are locked during the training process.
and indexed by key K k (i) ∈ . C f and C k are channel numbers of node features F k and keys K k , respectively. The keys K k for all nodes are random initialized trainable tensors. The edges E k are initialized according to the cosine distance between F k (i) and F k (j).

Graph attention module
The graph attention module shown in Fig. 4 is proposed to improve the relevance of the transferred knowledge by sampling a more relevant part for each specific task. This process also reduces the computational complexity by reducing the total number of nodes in the graph. More specifically, the module first encodes support samples and labels into task representation with the projector module, then samples a task-specific latent subgraph G lat from G k according to the task representation. Then, the graph sampler samples G lat from G k according to the queries.
The projector module first summarizes textual features from support sample labels L t and support image embedding S t into the task feature T t . It then decodes T t into queries Q t for nodes in G lat , where each query Q t (i) corresponds to a sampled node in G lat , i.e., Q t (i) would be the query for the i-th node in G lat . To generate task feature T t , we first use the visual feature encoder T s and the textual encoder T l to encode the corresponding label information. Then, we sum these two types of extracted information into a feature T t , which represents the fea-Support set features and labels Illustration of graph attention module. We first build the representation for each task by encoding feature from support samples and their labels into the task feature T t using encoder T s and T l , respectively. T t is then decoded into queries Q t for nodes in the subgraph. Then, a task-specific latent subgraph G lat is obtained by sampling nodes and corresponding edges from the unified knowledge graph according to Q t .
C. Yang et al. / Weakly Correlated Knowledge Integration for Few-shot Image Classification ture of a specific task: (1) The function T s encodes the visual features of all support samples in the task. Here, function T s first projects the feature vector of each sample from the support set into a latent space with a multi-layer perception (MLP), then averages all projected features into the visual "summary" of the task. Function T l encodes the label information of tasks: We first look up the corresponding nodes of the labels and then project the associated features into the summary space with another MLP and average them as the textual "summary".
To sample a task-specific subgraph from G k , we use N lat different decoders to decode them into N lat queries, one corresponding to a node in the latent subgraph G lat . The G lat is sampled by applying attention on G k with Q t as the query, K k as key, and F k , E k as values, i.e., where c is a constant factor that controls the tendency towards one-hot, and δ is the sigmoid function that restricts the range of elements in E k . We merge the latent subgraph G lat and the observed graph G obs into the initial prediction graph G 0 pre via the following process. We first concentrate on the node features of G obs and G lat , formally: where || is the concatenation operator. For edges, we keep the edges between G obs and G lat , and fill the missing edges with 0.5, i.e., where I is the unit matrix, and all elements are ones. 0.5 is the value that indicates unknown similarity in EGNN [19] .
In our graph attention module, G lat will be a "relaxed subgraph" of G k and mathematically one subgraph of G k when the following three conditions are met. First, M t is a binary matrix, i.e., Second, a node in the latent subgraph should consist of one and only one node in the original graph, where N k is the size of the unified knowledge graph G k .
The third condition is that the united knowledge graph nodes should be either sampled only once or not sampled. When (5) holds, this condition reduces to that nodes in G k shall not be sampled more than once, i.e., To make the subgraph sampling process differential, our method (2) relaxes the above conditions, making our "subgraph" a generalized case of the subgraph in terms of discrete math. Moreover, since we use softmax activation on the rows of M t , condition two, i.e., (6), always holds.
Condition one, i.e., (5), will not be strictly met, but can be considered approximately met because the rows of M t will tend to be one-hot. This tendency is due to the gradient property of the softmax function σ. To simplify the representation, we discuss and separately, i.e., The gradient of the softmax function is close to 0 when its output is close to one-hot, i.e., max(σ(x)(i)) is close to 1, and the other terms are consequently close to 0.
For condition three, we use a regularization term to reduce the pairwise-node similarity of the subgraph to enforce a close approximation. Having repeated nodes in G lat is the only case where condition three is violated, while conditions one and two are satisfied. To avoid this situation, we add a regularization term to enlarge the pairwise distance among node features in G lat . Thus, we consider G lat as a generalized subgraph of G k by relaxing conditions one and three slightly.

Optimization
In order to improve the efficiency of the model, we add a regularization term L sub in the module. L sub is applied to increase the diversity of nodes in G lat . We enforce the diversity by increasing the pairwise cosine dis-tance of node features, i.e., This regularization term also helps to enforce condition three of (6) for the latent subgraph G lat to be close to a strict subgraph of G k . This is because repetitive nodes will lead to larger losses due to the fact that identical vectors always have the largest cosine similarity, i.e., We adopt the edge classification loss L cls in EGNN [19] for classification and apply it to each layer of the graph neuro network, i.e., Note that edges connected to latent nodes do not contribute to the classification loss, and w is set to [0.5, 0.5, 1] following [19].
We also adopt a semantic branch [25] to further improve the performance. Slightly different from the original work, our implementation performs classification on all labels in a batch instead of the whole training set to reduce computation. The module takes the label and visual features as input and produces a classification loss L sem .
The final objective function is the weighted sum of the classification loss L cls , the semantic branch loss L sem , and the subgraph loss L sub , i.e., In our experiment, λ sub and λ sem are empirically set to 0.1 as these two are regularization terms, hence less important than L cls .

Datasets
In this section, we use three datasets to validate the proposed framework. More specifically, we use Mini-Im-ageNet and Tiered-ImageNet, which provide less visualcorrelated label annotation. Mini-ImageNet is a subset of ImageNet with 100 classes, consisting of 600 images per class. The dataset is split into the training set (64 classes), the validation set (16 classes), and the testing set (20 classes) [26] . Tiered-ImageNet is also a subset of Im-ageNet with 608 classes, and each class contains 600 images. Different from Mini-ImageNet, the Tiered-ImageNet dataset has structural information in label annotation.
The classes are categorized into 34 more general classes. The splitting of this dataset is also based on the general classes. The training set has 20 classes, the validation set has 6 general classes, and the testing set has 8 classes. We also use the CUB-2011 dataset that provides detailed annotations closely related to visual traits. CUB-2011 contains images of 200 different bird species. The dataset is split into the training set (100 classes), the validation set (50 classes), and the testing set (50 classes) [12] .

Implementation details
Our implementation is based on the EGNN [19] code base, which uses the Pytorch framework. For comparison with the latest methods, we also trained a heavier model on Mini-ImageNet using the pre-trained ResNet12 backbone from FEAT [27] . We locked the weight of the pretrained ResNet12 backbone to prevent overfitting.
We adopt two popular protocols used in the evaluation. For the first protocol, 600 random tasks are sampled from the testing set, where each task contains 15 query samples per class. We also adopt the one query protocol [28] , where only one query image is used for each class in a task. In this protocol, we sample 50 000 queries in 10 000 tasks to evaluate the performance of our model. For both protocols, the average accuracy of all evaluation tasks is used as a performance metric.
During training, we train our framework for 100 000 iterations on the training set for all three datasets. We use the Adam solver [29] , and the learning rate is initially set to 10 −3 . The learning rate is set to decay by a half for every 15 000 iterations for Mini-ImageNet and 30 000 for Tiered-ImageNet. We validate the model on the validation set for every 5 000 iterations and select the best model for testing. The batch size is set to 18 due to the limit of hardware resources. For the CUB dataset, we adopt the setup of Tiered-ImageNet with different batch sizes according to available hardware resources.

Ablation study
For simplicity, we use the second training and evaluation protocol (using one query per class) in this section. We perform ablation studies to each module baseline method. The experimental results of the ablation study on the Mini-ImageNet dataset are shown in Table 1. In Table 1, GAM denotes the graph attention module. "Textual" indicates knowledge transferred from the label description, and "Visual" indicates knowledge transferred from the training set. We use base to indicate the baseline method, and the other experiments are named with three characters following the specified rule: The first character indicates knowledge domains, the second indicates different graph sampler configs, and the third indicates whether the semantic branch is used.
The baseline method, indicated with gray background in Fig. 3, made a few implementation changes to make it Comparing base and A00, we observe that using transferred knowledge as latent variables can benefit classification performance. Experiments of VV0 VS. AV0, VAD VS. AAD validate the effectiveness of the textual domain in the unified knowledge graph. Comparing A00 and AV0 shows the effectiveness of the graph attention module, comparing AV0 and AA0 shows the effectiveness of introducing textual information into the graph attention module.
To sum up, each proposed module effectively increases the classification accuracy, and the whole method shows a significant improvement against the baseline method by improving the accuracy on most classes. The improvement can be seen in the intuitive comparison of tasks sampled from the testing set in Fig. 5. Statistical changes for each class are shown in Fig. 6. This result indicates that our model is generally effective for most classes. We visualize the difference between our method and the baseline method. To break down the improvement shown in Fig. 6 into more details, we then visualize the delta between the confusion matrices of the baseline method and our method in Fig. 7. The delta suggests that the improvement comes primarily from the misclassification among different clusters. The evidence is that most non-diagonal squares (indicating misclassification across clusters) are dominated by purple cells. This observation suggests that our method is able to utilize textual domain knowledge to improve discrimination between most "general" classes (k-means centers).
Another observation is that our model slightly increased the misclassification between cluster 4 (dogs) and cluster 5 (mostly large mammals). This is likely due to that these two clusters are semantically very close to each other. Note that the only two classes with a performance drop come from these two clusters. This shows that the In each task, the predicted similarity between samples (ranging from 0 to 1) is shown in color intensity, where zero similarity leads to a black square. Here, blue means "correct" response, and red indicates " wrong" responses. Pictures on the top row are query samples, while those on the left-most column are support labels.   "noise" in the textual domain may not be completely avoided, which is also a future topic for our research. It is also noticed that a large margin between validation and testing sets can be identified, mostly because the testing set has a more skewed data distribution in both textual and visual domains. We will also leave this as a future topic.

Few-shot classification
We perform experiments mainly on two different datasets: Mini-ImageNet and Tiered-ImageNet. Following the conventional way [13,24] , we train and evaluate our method with both 5-way 1-shot setup and 5-way 5-shot setup. We also train our model both with and without the transduction, which allows us to offer a fair comparison to popular methods. Transduction [20] indicates that the relations among testing samples are exploited. Methods known to be using such tricks will be marked with "(BN)" or "(T)", where "(BN)" means the model uses the task statistic [19] during evaluation, and "(T)" means more sophisticated approaches are applied. Methods involving transduction can be sensitive to the total number of queries in each task. Therefore, we also list this factor in corresponding tables. We also observe that some works may have different performance with different reimplementation and training/evaluation policies. In such cases, we will use the results from the original paper if not specified.
On the Mini-ImageNet dataset, our method shows competitive performance with and without transduction. Our model also shows competitive performance with many popular few-shot classification methods. The results are shown in Table 2, as the evaluation protocol and training methods may vary between different works, we give more details on the comparisons of results. More specifically, we list the number of queries in each task, the backbone used, and whether or not the validation set is used for each method. Note that the number of queries per class only affects the transduction setup. The performance of our method produces better performance than many state-of-the-art methods in transduction settings. Our method also produces promising results on non-transduction settings in both 5-way 1-shot and 5-way 5-shot problems. Particularly, our method generally leads to an improvement of 2%−3% against the EGNN [19] on accuracy in all setups.
On the Tiered-ImageNet dataset, we perform experiments using the transductive setups (QPC = 1 and QPC = 15), and the results are listed in Table 3. Our method also demonstrates promising performance compared to many popular methods. In more detail, our method enjoys better performance with the 1-shot setup with both protocols, i.e., around 4% higher than the TPN [20] . On the 5-shot setup, our method′s accuracy is slightly lower than the TPN method [20] by 0.3%. The results indicate that our framework can effectively utilize the transferred knowledge.
We also conduct experiments on the CUB-2011 dataset to validate the generalization ability on fine-grained classification tasks. Our model demonstrates competitive performance (presented in Table 4) with a simple Conv4 backbone. Our method outperforms Antreas′s method [42] by 10% in terms of accuracy on 1-shot tasks. In 5-shot tasks, our method again achieved a 6% improvement in accuracy. This is possibly due to the better correlation between the annotation and the visual features. These experiments demonstrate the effectiveness of our proposed framework in few-shot image classification tasks on different datasets and with different protocols. Our framework obtains reasonably good performance with the transductive evaluation and attains promising performance without transduction where information among queries can be exploited. These results validate that the proposed framework can utilize weakly correlated knowledge from different sources (e.g., the visual domain and the textual domain) to reach promising and robust performance on different datasets.

Computational complexity
Our proposed framework is not significantly larger than the baseline EGNN model in terms of computational complexity. The extra cost is brought by two parts: the size incremental of graph G pre caused by the auxiliary latent subgraph and the newly introduced graph attention module. Intuitively, the second part is not much large since the latent subgraph is generally small, as we control the size of the latent subgraph G lat with the proposed graph attention module to avoid huge graphs for GNN. As for the graph attention module, the projector is a network much smaller than the encoder, and the graph sampler also has low complexity.
In more details, the addition computation complexity is derived as following: For the additional cost caused by G pre , the complexity is changed from and the delta (the difference of (14) and (13)) is Because |G lat | is always smaller than |G obs | in this paper, the extra complexity in this module is just a constant factor less than 3.
For the graph attention module, the projector is much smaller compared to the CNN encoder. Therefore, we fo- where C k is the dimension of keys in G k . Sampling nodes takes and sampling edges takes In this work, we have a small G k due to the concern of training variance and the magnitude of the gradient in the attention module. Hence, this part is also considerably light-weighted. However, the only quadratic term of |G k | in our entire framework, which appears in (18), does not include the feature dimension, C f . This property offers our framework further potential to work with a large knowledge base G k without losing the relationship E k while maintaining a reasonable speed.

Textual domain
Quantity results ( Table 1) have shown that the textual domain provides useful information. However, intuitively the description is not always highly correlated to their visual traits, which can be observed in Fig. 1. This can also be supported by the visualization of the result of the k-means clustering results shown in Fig. 8. We can see that classes may or may not have intuitive common visual properties when the description vectors are se- Table 2 Comparative results on Mini-ImageNet. The method * indicates results from code and model released by the authors of EGNN [19] . mantically close, i.e., the samples in cluster 5 are mostly dogs, while cluster 2 contains a lot of things that are visually different. We can also observe that not all clusters are interpretable, indicating that the description and the GPT2 encoder may introduce bias. Due to these two reasons, we decided to use the transferred knowledge as latent variables rather than applying direct distance constraints.

Conclusions
To address the insufficient data problem in few-shot image classification tasks, we propose a weakly correlated knowledge integration framework. In the proposed framework, we use a unified knowledge graph to integrate knowledge from different domains into one feature space where relations among different domains are modeled with corresponding edges. The proposed attention-based graph attention module adaptively improves both the effectiveness and efficiency of our framework. The ablation studies show that each module is effective with few-shot learning tasks. Our framework also demonstrates promising results on different datasets.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons li-   [20] .

S3.1 (2)
The latent subgraph, which is an optimal subgraph of Gk with regard to task t. Used as a latent variable in the framework.  Tt S3.2 (1+) "Summary" of task t, i.e., a combination of outputs from Ts() and Tl().

S3.2 (1+)
The i-th decoder mapping task summary Tt to query Qt(i) for the i-th node in Glat for task t.
cence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Appendix A
This paper contains many notations. To make it more clear, we list the important ones in this lookup table (Table A.1). For each notation, we list the section where it is defined. The text in the bracket may help locate it faster, e.g., S3 (1+) means the notation is defined in the latter half of the first paragraph in Section 3. We also provide the algorithm block diagrams shown in Fig. A.1 for the training and evaluation process to make the whole framework clearer.

Appendix B
In this appendix, we provide more details on model dynamic and hyper-parameter sensitivity. For model dynamics, we show the curve of each loss term shown in Fig. B.1 of our method (AAD in Table 1) in the 5-way 1shot training process. All loss terms drop alongside the training process. Occlusions in L sem and L cls are possibly due to the noisy ImageNet dataset. L sub indicates that the average pairwise distance of node features in the latent subgraph is properly controlled. The sensivity against different loss weights in (12) is also analyzed and shown in Table B.1.