KGTN-ens: Few-Shot Image Classification with Knowledge Graph Ensembles

We propose KGTN-ens, a framework extending the recent Knowledge Graph Transfer Network (KGTN) in order to incorporate multiple knowledge graph embeddings at a small cost. We evaluate it with different combinations of embeddings in a few-shot image classification task. We also construct a new knowledge source - Wikidata embeddings - and evaluate it with KGTN and KGTN-ens. Our approach outperforms KGTN in terms of the top-5 accuracy on the ImageNet-FS dataset for the majority of tested settings.


Introduction
Deep learning has made a substantial impact on a number of industrial and research areas.This includes computer vision, as the rapid development of representation learning started with the seminal work of Krizhevsky, Sutskever and Hinton (2017) for the image classification task.However, numerous state-of-the-art models often require large amounts of data to train, which can be costly to gather and label -especially for vision-related tasks.Therefore, an intense research effort can be observed in the area of data-efficient machine learning methods.Few-shot learning (often abbreviated as FSL) is a machine learning task, where the machine learning model is (partially) trained on a small amount of data -part of the labelled data is available in standard amounts, whereas the other part consists of only a few (typically less than 10) samples per class.Few-shot learning can also suffer from selection bias since the decision boundaries need to be adjusted to a new few samples, which can contain irrelevant and misleading artefacts (such as a background colour).Hence the learning process is substantially more challenging.
One way to tackle the few-shot learning task is to use some prior knowledge of the labelled data.Knowledge Graph Transfer Network (KGTN), the recent work of Chen, Chen, Hui, Wu, Li and Lin (2020), solves this problem by learning the prototypes from the external sources of knowledge and comparing them against extracted features from an input image.A similarity function scores the output of these two and yields the class probability distribution.These external sources of knowledge are represented as class correlation matrices.A vital element of this architecture is the knowledge graph transfer module (KGTM), which tries to learn class prototypes from knowledge graph embeddings using gated graph neural networks (GGTN) (Li, Zemel, Brockschmidt and Tarlow, 2016).
In the KGTN approach, one has to select a single knowledge source.Inspired by ensemble learning approaches, this observation leads us to the following questions: is it possible to learn prototypes from multiple knowledge graph embeddings?If so, will it result in higher performance metrics values, such as accuracy for classification problems?Therefore, we propose KGTN-ens, an extension of KGTN, that use multiple embeddings instead of a single one.Each of them generates different prototypes, which are later combined and compared against the output of the feature extractor.We test two ensemble learning techniques in this paper.We also evaluated different combinations of three knowledge graphs, one of which (based on Wikidata) is introduced by us and has not been used in the original paper.Our solution is knowledge graph agnostic, provided that the knowledge graph is embedded and linked to the classes used in the image classification.
The contribution of this paper is two-fold: (1) we propose KGTN-ens, a new method based on KGTN, and evaluate it with different combinations of embeddings, (2) we construct a new knowledge source -Wikidata embeddingsand evaluate it with KGTN and KGTN-ens.Our approach outperforms KGTN in terms of the top-5 accuracy on the ImageNet-FS dataset for the majority of tested settings.
The remainder of this paper is organised as follows.A comprehensive literature survey on related work is presented in Section 2. Section 3 provides a description of the KGTN-ens architecture.Section 4 describes the results of the evaluation of Wikidata embeddings with KGTN and KGTN-ens with different combinations of embeddings, along with the detailed analysis and ablation studies.Section 5 concludes the paper.

Related work
This section provides a comprehensive overview of the related work.We start with a brief review of the techniques used for graph neural networks, which are at the core of the KGTN-ens architecture.Then, we provide a short survey on recent advancements in few-shot learning, which is the main machine learning task solved by the architecture presented in this paper.
Graph neural networks.In general, Graph neural networks (GNNs) represent a type of neural network, which processes the specified attributes of the graphs.Task tackled by GNN can be either node-level (such as prediction of a property for each node), edge-level (prediction of a property for each edge), or graph-level (prediction of a property for a whole graph) (Sanchez-Lengeling, Reif, Pearce and Wiltschko, 2021).Following (Keriven and Peyré, 2019), a crucial feature of GNNs is being either invariant or equivariant to permutations.That is, for a graph , network and a permutation Π we have (Π ⋆ ) = () and (Π ⋆ ) = Π ⋆ () for invariance and equivariance respectively.The general-purpose models from the state-ofthe-art family of transformer architectures Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin (2017) can be viewed as a special instance of a graph neural network.Graph neural networks have a wide area of applications, with notable examples in biology (e.g. protein interface prediction) or social networks (e.g.community detection or link prediction).The less obvious application of GNNs is in the field of image classification, where they are used to learn the prototypes from the knowledge graph embeddings in a few-shot learning setting.
GNNs fall into a broader category of geometric deep learning, which is devoted to the application of deep neural networks on structured non-Euclidean domains, such as graphs, manifolds, meshes, or grids (Bronstein, Bruna, Le-Cun, Szlam and Vandergheynst, 2017).Gilmer, Schoenholz, Riley, Vinyals and Dahl (2017) proposed message passing, which is one of the most important concepts in GNNs.In this approach, nodes and/or edges can rely on their neighbours in order to create meaningful embeddings iteratively.Wu, Pan, Chen, Long, Zhang and Philip (2020) classify GNNs into four broad categories: recurrent GNNs (RecGNN), convolutional GNNs (ConvGNNs), graph autoencoders (GAEs), and spatial-temporal GNNs (STGNNs).In this article, Gated Graph Neural Network (GGNN) (Li et al., 2016) are of special interest.They belong to the category of RecGNNs.For a fair comparison with KGTN (our baseline), we used GGNN in our experiments.Following Li et al. (2016), the intuitive difference between GNN and GGNN relies on the explicit graph structure of GNNs, which results in more generalisation capabilities at the expense of a less general model of the latter.
Few-shot learning.While being very effective for numerous vision tasks, one of the main problems with convolutional neural networks (or machine learning in general) is the amount of data they need to provide meaningful predictions.More recent architectures, such as self-attention models require even more data to train.On contrary, humans typically require only a few samples to acquire knowledge of seen objects.One way to tackle this issue is few-shot learning, which is aimed at learning from scarce data.The complexity of the problem often stems from the required sudden shift of decision boundaries, which is hard to achieve using only a few samples.A special case of few-shot learning is one-shot learning, which is learning from one labelled sample per class.
Following Song, Wang, Mondal and Sahoo (2022), fewshot learning methods can be divided into data augmentation, transfer learning, meta-learning, and multimodal learning.Data augmentation techniques aim to artificially extend the amount of available data by either transforming input data (Chen, Fu, Wang, Ma, Liu and Hebert, 2019a) or resulting features (Chen, Fu, Zhang, Jiang, Xue and Sigal, 2019b).Transfer learning focuses on resuing features from networks trained on different datasets with the required amount of data by techniques such as pre-training and fine-tuning or domain adaptation.Meta-learning includes techniques devoted to learning from data and tasks in order to reuse this knowledge for future downstream tasks.Finn, Abbeel and Levine (2017) proposed a model agnostic meta-learning algorithm MAML.Specialised approaches to meta-learning include neural architecture search (Elsken, Metzen and Hutter, 2019) or metric learning (Ge, 2018;Chicco, 2021).Finally, multimodal learning focuses on the incorporation of external knowledge from heterogenous domains, such as text, speech or knowledge graphs (Wang, Yue, Liu, Tian and Wang, 2020).
The concept of prototypes was introduced in the work of Snell, Swersky and Zemel (2017), where they proposed prototypical networks focused on learning metric space between class instances and their prototypes.Hariharan and Girshick (2017) used representation regularisation and introduced the concept of hallucinations in order to enlarge the number of available representations during the training.Wang, Girshick, Hebert and Hariharan (2018) employed meta-learning techniques and combined them with the aforementioned hallucinations to improve few-shot classification metrics.A growing number of scholars incorporate structured knowledge into their computer vision research (Monka, Halilaj and Rettinger, 2022).For instance, Li, Luo, Lu, Xiang and Wang (2019) studied transferable features with the hierarchy which encodes the semantic relations.Their approach turned out to be applicable to the problem of zero-shot learning as well.Shen, Brbic, Monath, Zhai, Zaheer and Leskovec (2021) proposed model agnostic regularisation technique in order to leverage the relationship between graph labels to preserve category neighbourhood.

Method
This section explains the details of KGTN-ens.The method extends the KGTN architecture proposed by Chen et al. (2020), which relies on graph-based knowledge transfer to yield state-of-the-art results on few-show image classification.The most important difference relies on the usage of multiple graphs instead of a single one, which enables the usage of different knowledge sources.Each of these graphs generates different prototypes, which are later combined and compared against the output of the feature extractor.It might be not immediately obvious why the approach with multiple knowledge graphs is used, as they may be merged into one using owl:sameAs or similar property.Notice that this method does not require knowledge graphs in a strict sense -KGTM processes only distances between classes, which are later used for scoring prototypes.Therefore, integrating different sources of knowledge is fairly easy and requires a minimum amount of effort -the KGTN-ens architecture seamlessly handles different types of distances derived from embeddings.
Problem formulation.Following Chen et al. ( 2020), the classification task is formulated as learning the prototypes of considered classes.In the typical approach to classification, the model prediction ̂ based on the input is obtained in the following way: where is calculated using the standard softmax function: where is the number of considered classes and is the linear classifier.Since the ( ) can be formulated as follows: for each , , the classifier can be perceived as a similarity measure between the extracted features and prototypes: As a result, can be interpreted as a prototype for class , and these prototypes are learned during the training process.
The overall architecture of KGTN-ens is presented in Figure 1 and it consists of three main parts: Feature Extractor, KGTMs, and Prediction with ensembling.Feature Extractor is a convolutional neural network that extracts features from the input image, such as ResNet (He, Zhang, Ren and Sun, 2016).KGTMs refer to the list of knowledge graph transfer modules (each one handles a different knowledge graph) that are used to generate prototypes.Finally, prediction with ensembling a module that scores extracted features against obtained prototypes in order to make the final classification.
KGTMs.Since we use the plain ResNet50 for the feature extractor part, we start the description with the KGTMs part.Consider a dataset of images, where each of them is associated with either a base class or a novel class.There are base base classes and novel novel classes ( = base + novel ).In the original KGTN approach, the correlations between categories are encoded in a graph  = { , }, where = { 1 , 2 , … , base … , } represents classes and denotes an adjacency matrix, in which , is the correlation between classes and .Our approach extends this concept in a way in which there are multiple graphs  1 , … ,  .Specifically, each of them shares the same classes but has different correlation values stored in matrices.
Just as KGTN, KGTN-ens is based on Gated Graph Neural Network (Li et al., 2016), in which each class is represented by a node is associated with a hidden state ℎ at time .It is initialised with 0 = init , where init are chosen at random.The parameter vector for node at time ∈ {1, … , } is defined as: where ′ denotes the correlation between nodes and ′ .The hidden states for weight at time are determined with a gating mechanism inspired by GRU (abbr.from gated recurrent unit), which was introduced by Cho, Van Merriënboer, Bahdanau and Bengio (2014): Here, and are the weights for the update gate, and and are the weights for the reset gate.The hyperbolic tangent function is given by tanh, whereas is the sigmoid function.The final weight * for class is defined as: where is the fully connected layer.
Prediction and ensembling.The classifier ( ) is treated as a similarity metric between the output of the feature extractor and the most similar class prototypes learned by the knowledge graph transfer module.In the original KGTN approach, the relationship between these two was calculated using the inner product, cosine similarity or Person's correlation coefficient.For the inner product, which was the most effective, the classifier was defined as ( ) = ⋅ * , where is the feature vector of an image and * denotes the learned weight for class .Conventionally, ( ) = arg max ( ).
However, in our approach, we use the ensembling-inspired technique to improve the performance of the classifier.In KGTN-ens, we calculate similarity for each of the available graphs.Using a similar inner product approach, this is done the following way: , ( ) = ⋅ * , , where * , is the learned weight for the -th graph.Then, the final result for class has to be chosen.Such an approach is inspired by ensemble learning strategies, though we do not use weak learners in a strict sense.One of the main drawbacks of ensemble learning -the linear memory complexity with the proportional computational burden -is partially avoided, as only the part of the network is multiplied.Most importantly, the feature extractor, which often can be the largest component of modern architectures, is used only once.This enables us to fit several knowledge sources on proprietary GPUs (we used a single NVIDIA RTX 2080 Ti in our experiments).We propose two simple approaches for selecting the final result: mean and maximum.For the former, the result for class is the mean of the products: In ensemble learning literature, this would be called soft voting.The maximum approach is very similar: In other words, we take the maximum of the similarities for each of the available graphs.Optimisation.To enable fair comparison, we use a two-step training regime similar to Hariharan and Girshick (2017) and Chen et al. (2020) -the first is devoted to the feature extractor, whereas the second one fine-tunes the graphrelated part of the network.In the first stage, we train the feature extractor (⋅) using the base classes from  .The loss  1 calculated in this step consists of the standard crossentropy loss and squared gradient magnitude loss (Hariharan and Girshick, 2017), which acts as a regularisation term: where: where is the indicator function and is a loss balance parameter.In the second stage, the weights of the feature extractor are frozen.Other parts of the architecture are trained using base and novel samples with the following loss: where balances the loss components.

Evaluation
This section contains the results of the conducted experiments.First, we introduce the used knowledge sources -semantic similarity graph, WordNet and Wikidata.Then, we describe the evaluation of KGTN-ens with different combinations of embeddings and compare them with the previous work.Finally, we provide a detailed analysis and ablation studies.

Knowledge sources
In our evaluation, we use three different sources of knowledge, which can be the backbone of KGTMs: hierarchy, glove, and wiki.The first two have been proposed by Chen et al. (2020).The wiki graph is constructed on top of Wikidata, a collaborative knowledge graph connected to Wikipedia (Vrandečić and Krötzsch, 2014).In this subsection, we discuss the preparation of these knowledge sources in detail.
Semantic similarity graph (glove).The first source of knowledge is built from GLoVe word embeddings (Pennington, Socher and Manning, 2014).For two words and , their semantic distance , is defined as the Euclidean distance between their GLoVe embeddings and .Following Chen et al. (2020), the final correlation coefficient , is obtained using the following function: where = 0.4 and = 1.WordNet category distance (hierarchy).This source of knowledge is built from the WordNet hierarchy -a popular lexical database of English (Miller, 1995).Since ImageNet classes are based on WordNet, the WordNet hierarchy can be used to measure the distance between two classes.This time the distance , is defined as the number of common ancestors of the two words (categories) and .The output is processed similarly to Equation ( 15), except that the parameter is set to 0.5.

Wikidata embeddings (wiki).
The last source of knowledge is built from the Wikidata embeddings.The mapping between the ImageNet classes and Wikidata is provided by Filipiak, Fensel and Filipowska (2021).Having the mapping, the class-corresponding entities from Wikidata can be embedded and used as a class prototypes.Although there exist some datasets of Wikidata embeddings, they are often incomplete.Most importantly, they does not contain all the embeddings of ImageNet classes.Wembedder (Nielsen, 2017) offers 100-dimensional Wikidata embeddings made using the word2vec algorithm (Mikolov, Chen, Corrado and Dean, 2013), but it bases on an incomplete dump of Wikidata and does not contain all the classes nedded in the ImageNet-FS dataset.Zhu, Xu, Tang and Qu (2019) proposed Graphvite, a general graph embedding engine.Wikidata5m is a large dataset of 5 million Wikidata entities, which is used to train the embeddings.The framework comes with embeddings created using numerous popular algorithms, such as TransE, DistMult, ComplEx, SimplE, RotatE, and QuatE.However, 891 out of 1000 entities used in the ImageNet are embedded, which was not enough for performing the experiment.
We used the pre-trained 200-dimensional embeddings of Wikidata entities from PyTorch BigGraph (Lerer, Wu, Shen, Lacroix, Wehrstedt, Bose and Peysakhovich, 2019), which are publicly available 1 .The embeddings were prepared using the full Wikidata dump from 2019-03-06.All but three entities were directly mapped to embeddings to their Wikidata ID.Three entities (Q1295201, Q98957255, Q89579852) could not be instantly matched -they were manually matched to "grocery store"@en, "cricket"@en, and Q655301 respectively.Having the mapping, now we create an embedding array, ordered as the mappings in the original KGTN paper (that is, as a 1000 × 200 array, where 200 denotes the dimensionality of a single embedding).The same function from Equation (15) was used to generate final correlations between the embeddings, although this time = 0.32 was used (see Section 4.3).

Experiment results
In this subsection, we present the results of the conducted experiments.We describe the evaluation data -the experiment has been conducted on ImageNet-FS dataset.
The training hyperparameters and the setup is also described.We also describe the evaluation protocol, as well as the evaluation metrics.Finally, we present the results of the experiments and compare them with the previous work.
Data.Similarly to Chen et al., our approach has been evaluated on ImageNet-FS, a popular benchmark for fewshot learning task.ImageNet-FS contains 1,000 classes from ImageNet Large Scale Visual Recognition Challenge 2012 (Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg and Fei-Fei, 2015), of which 389 belongs to the base category and 611 to the novel category.193 base classes and 300 novel classes are used for training and cross-validation, whereas the test phase is performed on the remaining 196 categories and 311 novel classes.Base categories consist of around 1280 train and 50 test images per each class.The authors of KGTN also evaluated their solution against a larger dataset, ImageNet-6K, which contains 6,000 classes (of which 1,00 belongs to the novel category).Unfortunately, we were unable to test KGTN-ens using this dataset, since it has not been made public nor available to us at the time of writing this paper.
Training.To enable fair comparison, we used the same 2-step training and evaluation procedures as in KGTN.Stochastic gradient descent (SGD) was used to train the model with a batch size equal to 256 (divided equally for base and novel classes), a momentum of 0.9, and a weight decay of 0.0005.The learning rate is initially set at 0.1 and divided by 30 at every 30 epochs.In general, we used the same hyperparameters as in KGTN unless stated otherwise.
Setup.All the experiments have been conducted on a single NVIDIA GeForce RTX 2080 Ti GPU.We used the code released by the authors of KGTN and modified it to support the KGTN-ens approach.PyTorch Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga and Lerer (2017) was used to conduct the experiments.The code will be released after the publication of this article.
Evaluation.Following previous work in few-shot learning, we report our evaluation results in terms of the top-5 accuracy of novel and all (base + novel) classes in theshot learning task, where ∈ {1, 2, 5, 10} is the number of classes in the novel category.Following Hariharan and Girshick (2017) and Chen et al. (2020), we repeat each experiment five times and report the averaged values of the top-5 accuracy.Table 1 shows the classification results

Table 1
Top-5 accuracy on novel and all subsets on ImageNet-FS.All the methods used ResNet-50 as a feature extractor.Partially based on the data provided by Chen et al. (2020).

Details and ablation studies
This subsection provides more details on the KGTN-ens model and its ablation studies.We analyse the impact of the following factors on the performance of the KGTN-ens model: adjacency matrices, used embeddings, ensembling method, similarity function, and variance of the results.
Adjacency matrix analysis.Since glove knowledge graph was the most effective for KGTN, we assume that wiki should roughly resemble it in terms of its distribution.In order to investigate the similarity between distributions, adjacency matrices have been created using pairwise euclidean distances.While glove and wiki are normal-like, the distribution for hierarchy is bimodal and most of the distances are the highest ones (Figure 3).To assess the correlation between adjacency matrices, Mantel tests have been performed (Table 3).The values marked as processed were run through Equation ( 15).Correlations of the processed matrices are visibly higher compared to raw ones, especially regarding glove and wiki).The highest correlation has been observed between glove and wiki.Importance of used knowledge graphs.Firstly, we analyse the influence of the used KGs separately (without ensembling) -that is, with the original KGTN architecture.Table 4 shows the results of the ablation studies on the three knowledge graphs.The hierarchy and glove knowledge graphs are the ones examined by Chen et al. (2020), whereas the wiki knowledge graph is the one introduced in our experiments.In order to ensure that the advantage comes from the knowledge encoded in KGs, Chen et al. argue that glove and hierarchy embeddings perform better than uniform (all correlations set to 1∕ ) and random (correlations drawn from the uniform distributions) distance matrices.Similarly, the usage of wiki knowledge graph yielded generally better results (up to +3.44 pp for 1-shot in the novel category) compared to random and uniform cases, which constitutes a noticeable improvement.However, compared to glove and hierarchy, the wiki knowledge graph yields worse results -notably for low-shot scenarios.We hypothesise that the difference in the performance of wiki knowledge graph is due to the low quality of embeddings, as some issues regarding their accuracy were previously reported2 .Importance of the ensembling method.Table 6 presents results for the different ensembling strategies compared to the KGTN baseline, which can be treated as a KGTN-ens model with no ensembling.Mean ensembling gave mixed results compared to the baseline (+0.34, −0.63, +0.36, −0.27 pp. for novel classes and −1.45, −1.41, +0.14, −0.17 pp. for all classes, both groups for ∈ {1, 2, 5, 10} respectively.However, using the max ensembling strategy has been better in all the cases (+0.77, +0.40, +0.29, +0.07 pp. for novel classes and +0.24, +0.18, +0.19, +0.06 pp. for all classes).A possible explanation of this effect might stem from the winner takes all nature of the maximum function, which chooses the most similar embedding to the given prototype and rejects other, potentially improper, embeddings.At the same time, these improper embeddings still contribute to the overall formula for the mean ensembling function.However, research on a larger number of employed knowledge graphs has to be conducted to validate this hypothesis.
Variance of the results.Contrary to expectations, adding additional knowledge sources slightly increase the variance of the results in most cases (Table 5).A possible explanation of these results is the fact that KGTN-ens is not an ensembling technique in the typical sense of this word, but rather a way of choosing the embeddings of the different knowledge sources.We report results for novel classes only, as the difference in variance is amplified among these (see also Fig. 2).No significant differences in the variance of mean and max ensembling have been found.The variance of the results for baseline KGTN has been obtained using five runs of the original KGTN with glove embeddings.
Importance of similarity function.Table 4 includes data for performing ablative studies for KGTN with the three different similarity functions: cosine similarity, inner product and Pearson correlation.(Chen et al., 2020) analysed all these for KGTN with glove embeddings.In general, the inner product showed the best performance.These conclusions can be extrapolated to the wiki graph, as the inner product usually turned out to be the most effective in terms of the top-5 accuracy.Interestingly, Pearson correlation displayed the best performance for the 1-shot scenario with novel classes.Table 7 presents results for the different similarity functions used in the KGTN-ens.While the combination of hierarchy and glove embeddings was usually the best for cosine similarity as well, the results are visibly worse compared to the inner product similarity function (e.g.−3.16 pp.top-5 accuracy difference for 1-shot scenario among novel classes).Noticeably, the combination of these two graphs and cosine similarity function performed worse than KGTN solely based on glove embeddings (for example, there is a −2.39 pp.difference for top-5 accuracy difference for 1-shot scenario among novel classes).

Conclusion
In this work, we proposed KGTN-ens, which builds on KGTN and allows the incorporation of multiple knowledge sources in order to achieve better performance.We evaluated KGTN-ens on the ImageNet-FS dataset and showed that it outperforms KGTN in most of the tested settings.We also evaluated Wikidata embeddings in the same task and showed that they are not as effective as the other embeddings.We believe that the proposed approach can be used in other few-shot learning tasks and we plan to test it in the future.Although not publicly available at the time of writing this article, further work might include an evaluation of the proposed approach on ImageNet-6K dataset (Chen et al., 2020).A certain limitation of this study is the fact that it might not scale well for extreme classification problems, due to the calculation of pairwise distances of nodes from large knowledge graphs requiring quadratic memory complexity.

Figure 2 :
Figure 2: KNGT-ens (blue) performance mean top-5 accuracy with compared to the KGTN (orange) over 5 runs.KGNTens uses glove and hierarchy graphs combined with the max ensembling function.Horizontal lines indicate standard deviations.

Table 2
Descriptive statistics about the adjacency matrices.

Table 4
Chen et al. (2020)ults for KGTN on different embeddings tested against different similarity functions.Results for glove and hierarchy embeddings are provided byChen et al. (2020).Bold results mean the best ones for the wiki graph only.