1 Introduction

Deep learning has made a substantial impact on a number of industrial and research areas. This includes computer vision, as the rapid development of representation learning started with the seminal work of [17] for the image classification task. However, numerous state-of-the-art models often require large amounts of data to train, which can be costly to gather and label – especially for vision-related tasks. Therefore, an intense research effort can be observed in the area of data-efficient machine learning methods. Few-shot learning (often abbreviated as FSL) is a machine learning task, where the machine learning model is (partially) trained on a small amount of data – part of the labelled data is available in standard amounts, whereas the other part consists of only a few (typically less than 10) samples per class. Few-shot learning can also suffer from selection bias since the decision boundaries need to be adjusted to a new few samples, which can contain irrelevant and misleading artefacts (such as a background colour). Hence the learning process is substantially more challenging.

One way to tackle the few-shot learning task is to use some prior knowledge of the labelled data. Knowledge Graph Transfer Network (KGTN), the recent work of [2], solves this problem by learning the prototypes from the external sources of knowledge and comparing them against extracted features from an input image. A similarity function scores the output of these two and yields the class probability distribution. These external sources of knowledge are represented as class correlation matrices. A vital element of this architecture is the knowledge graph transfer module (KGTM), which tries to learn class prototypes from knowledge graph embeddings using gated graph neural networks (GGTN) [20].

In the KGTN approach, one has to select a single knowledge source. Inspired by ensemble learning approaches, this observation leads us to the following questions: is it possible to learn prototypes from multiple knowledge graph embeddings? If so, will it result in higher performance metrics values, such as accuracy for classification problems? Therefore, we propose KGTN-ens, an extension of KGTN, that uses multiple embeddings instead of a single one. Each of them generates different prototypes, which are later combined and compared against the output of the feature extractor. We test two ensemble learning techniques in this paper. We also evaluated different combinations of three knowledge graphs, one of which (based on Wikidata) was introduced by us and has not been used in the original paper. Our solution is knowledge graph agnostic, provided that the knowledge graph is embedded and linked to the classes used in the image classification.

The contribution of this paper is two-fold: (1) we propose KGTN-ens, a new method based on KGTN, and evaluate it with different combinations of embeddings, (2) we construct a new knowledge source – Wikidata embeddings – and evaluate it with KGTN and KGTN-ens. We evaluate KGTN-ens with different combinations of embeddings in a few-shot image classification task. For a standard few-shot benchmark setup (ImageNet-FS dataset, ResNet-50 as a feature extractor), our approach outperforms KGTN in terms of the top-5 accuracy on the ImageNet-FS dataset for the majority of tested settings. Specifically, we achieved +0.63/+0.58/+0.43/+0.26 pp. (novel classes) and +0.26/+0.25/+0.32/–0.04 pp. (all classes) for \(k \in \{ 1, 2, 5, 10\}\) respectively (averaged over 5 different runs). The method also extends the original KGTN approach by using not one, but multiple knowledge sources at a small computational cost. We also construct a new knowledge source – Wikidata embeddings – and evaluate it with KGTN and KGTN-ens. These embeddings may serve as a new knowledge source for other tasks beyond this study. The code available on GitHubFootnote 1

The remainder of this paper is organised as follows. A comprehensive literature survey on related work is presented in Section 2. Section 3 provides a description of the KGTN-ens architecture. Section 4 describes the results of the evaluation of Wikidata embeddings with KGTN and KGTN-ens with different combinations of embeddings, along with the detailed analysis and ablation studies. Section 5 concludes the paper.

2 Related work

This section provides a comprehensive overview of the related work. We start with a brief review of the techniques used for graph neural networks, which are at the core of the KGTN-ens architecture. Then, we provide a short survey on recent advancements in few-shot learning, which is the main machine learning task solved by the architecture presented in this paper.

Graph neural networks

In general, Graph neural networks (GNNs) represent a type of neural network, which processes the specified attributes of the graphs. Task tackled by GNN can be either node-level (such as prediction of a property for each node), edge-level (prediction of a property for each edge), or graph-level (prediction of a property for a whole graph) [28]. Following [16], a crucial feature of GNNs is being either invariant or equivariant to permutations. That is, for a graph \(\mathcal {G}\), network f and a permutation \(\Pi \) we have \(f(\Pi \star \mathcal {G})=f(\mathcal {G})\) and \(f(\Pi \star \mathcal {G})=\Pi \star f(\mathcal {G})\) for invariance and equivariance respectively. The general-purpose models from the state-of-the-art family of transformer architectures [32] can be viewed as a special instance of a graph neural network. Graph neural networks have a wide area of applications, with notable examples in biology (e.g. protein interface prediction) or social networks (e.g. community detection or link prediction). The less obvious application of GNNs is in the field of image classification, where they are used to learn the prototypes from the knowledge graph embeddings in a few-shot learning setting.

GNNs fall into a broader category of geometric deep learning, which is devoted to the application of deep neural networks on structured non-Euclidean domains, such as graphs, manifolds, meshes, or grids [1]. Gilmer et al. [13] proposed message passing, which is one of the most important concepts in GNNs. In this approach, nodes and/or edges can rely on their neighbours in order to create meaningful embeddings iteratively. Wu et al. [37] classify GNNs into four broad categories: recurrent GNNs (RecGNN), convolutional GNNs (ConvGNNs), graph autoencoders (GAEs), and spatial-temporal GNNs (STGNNs). In this article, Gated Graph Neural Network (GGNN) [20] are of special interest. They belong to the category of RecGNNs. For a fair comparison with KGTN (our baseline), we used GGNN in our experiments. Following [20], the intuitive difference between GNN and GGNN relies on the explicit graph structure of GNNs, which results in more generalisation capabilities at the expense of a less general model of the latter.

Few-shot learning

While being very effective for numerous vision tasks, one of the main problems with convolutional neural networks (or machine learning in general) is the amount of data they need to provide meaningful predictions. More recent architectures, such as self-attention models require even more data to train. On the contrary, humans typically require only a few samples to acquire knowledge of seen objects. One way to tackle this issue is few-shot learning, which is aimed at learning from scarce data. The complexity of the problem often stems from the required sudden shift of decision boundaries, which is hard to achieve using only a few samples. A special case of few-shot learning is one-shot learning, which is learning from one labelled sample per class.

Following [31], few-shot learning methods can be divided into data augmentation, transfer learning, meta-learning, and multimodal learning. Data augmentation techniques aim to artificially extend the amount of available data by either transforming input data [3] or resulting features [4]. Transfer learning focuses on resuing features from networks trained on different datasets with the required amount of data by techniques such as pre-training and fine-tuning or domain adaptation. Meta-learning includes techniques devoted to learning from data and tasks in order to reuse this knowledge for future downstream tasks. Finn et al. [10] proposed a model agnostic meta-learning algorithm MAML. Specialised approaches to meta-learning include neural architecture search [8] or metric learning [5, 11]. Finally, multimodal learning focuses on the incorporation of external knowledge from heterogenous domains, such as text, speech or knowledge graphs [35].

The concept of prototypes was introduced in the work of [30], where they proposed prototypical networks focused on learning metric space between class instances and their prototypes. Hariharan and Girshick [14] used representation regularisation and introduced the concept of hallucinations in order to enlarge the number of available representations during the training. Wang et al. [36] employed meta-learning techniques and combined them with the aforementioned hallucinations to improve few-shot classification metrics. A growing number of scholars incorporate structured knowledge into their computer vision research [23]. For instance, [19] studied transferable features with the hierarchy which encodes the semantic relations. Their approach turned out to apply to the problem of zero-shot learning as well. Shen et al. [29] proposed model agnostic regularisation technique in order to leverage the relationship between graph labels to preserve category neighbourhood.

Fig. 1
figure 1

Architecture of KGTN-ens

3 Method

This section explains the details of KGTN-ens. The method extends the KGTN architecture proposed by [2], which relies on graph-based knowledge transfer to yield state-of-the-art results on few-show image classification. The most important difference relies on the usage of multiple graphs instead of a single one, which enables the usage of different knowledge sources. Each of these graphs generates different prototypes, which are later combined and compared against the output of the feature extractor. It might be not immediately obvious why the approach with multiple knowledge graphs is used, as they may be merged into one using owl:sameAs or similar property. Notice that this method does not require knowledge graphs in a strict sense – KGTM processes only distances between classes, which are later used for scoring prototypes. Therefore, integrating different sources of knowledge is fairly easy and requires a minimum amount of effort – the KGTN-ens architecture seamlessly handles different types of distances derived from embeddings.

Problem formulation

Following [2], the classification task is formulated as learning the prototypes of considered classes. In the typical approach to classification, the model prediction \(\hat{y}\) based on the input x is obtained in the following way:

$$\begin{aligned} \hat{y} = \underset{k}{\text {argmax}}\ p(y = k \mid x) \end{aligned}$$
(1)

where p is calculated using the standard softmax function:

$$\begin{aligned} p(y = k \mid x) = \frac{\exp \left( f_{k}(\textbf{x}) \right) }{\sum _{i=1}^{K} \exp \left( f_{i}(\textbf{x}) \right) }, \end{aligned}$$
(2)

where K is the number of considered classes and \(f_{k}\) is the linear classifier. Since

$$\begin{aligned} \underset{k}{\text {argmax}}\ p(y = k \mid \textbf{x}) = \underset{k}{\text {argmax}}\ f_{k}(\textbf{x}), \end{aligned}$$
(3)

the \(f_{k}(\textbf{x})\) can be formulated as follows:

$$\begin{aligned} f_{k}(\textbf{x})= & {} \textbf{w}^T_k \textbf{x} + b_k = - \frac{1}{2} \left\Vert \textbf{w}_k - \textbf{x} \right\Vert ^2_2 \nonumber \\ {}{} & {} + \frac{1}{2} \left\Vert \textbf{w}_k\right\Vert ^2_2 + \frac{1}{2} \left\Vert \textbf{x} \right\Vert ^2_2 + b_k \nonumber \end{aligned}$$
(4)

setting \(b_k = 0\) and \( \left\Vert \textbf{w}_i \right\Vert _2 = \Vert \textbf{w}_j \Vert _2\) for each ij, the classifier \(f_k{\textbf{x}}\) can be perceived as a similarity measure between the extracted features and prototypes:

$$\begin{aligned} \hat{y} = \underset{k}{\text {argmax}}\ p(y = k \mid \textbf{x}) = \underset{k}{\text {argmax}}\ \left\Vert \textbf{w}_k - \textbf{x} \right\Vert ^2_2. \end{aligned}$$
(5)

As a result, \(\textbf{w}_k\) can be interpreted as a prototype for class k, and these prototypes are learned during the training process.

The overall architecture of KGTN-ens is presented in Fig. 1 and it consists of three main parts: Feature Extractor, KGTMs, and Prediction with ensembling. Feature Extractor is a convolutional neural network that extracts features from the input image, such as ResNet [15]. KGTMs refer to the list of knowledge graph transfer modules (each one handles a different knowledge graph) that are used to generate prototypes. Finally, prediction with ensembling a module that scores extracted features against obtained prototypes in order to make the final classification.

KGTMs

Since we use the plain ResNet50 for the feature extractor part, we start the description with the KGTMs part. Consider a dataset of images, where each of them is associated with either a base class or a novel class. There are \(K_{\texttt {base}}\) base classes and \(K_{\texttt {novel}}\) novel classes (\(K=K_{\texttt {base}}+K_{\texttt {novel}}\)). In the original KGTN approach, the correlations between categories are encoded in a graph \(\mathcal {G}=\{ \textbf{V}, \textbf{A} \}\), where \(\textbf{V} = \{ v_1, v_2, \dots , v_{K_{\texttt {base}}} \dots , v_K \}\) represents classes and \(\textbf{A}\) denotes an adjacency matrix, in which \(A_{i,j}\) is the correlation between classes \(v_i\) and \(v_j\). Our approach extends this concept in a way in which there are multiple graphs \(\mathcal {G}_1, \dots , \mathcal {G}_M\). Specifically, each of them shares the same classes \(\textbf{V}\) but has different correlation values stored in \(\textbf{A}\) matrices.

Just as KGTN, KGTN-ens is based on Gated Graph Neural Network [20], in which each class is represented by a node \(v_k\) is associated with a hidden state \(h^t_k\) at time t. It is initialised with \(\textbf{h}_k^0=\textbf{w}_{k}^{\text {init}}\), where \(\textbf{w}_{k}^{\text {init}}\) are chosen at random. The parameter vector \(\textbf{a}_k^t\) for node k at time \(t \in \{1, \ldots , T\}\) is defined as:

$$\begin{aligned} \textbf{a}_k^t=\left[ \sum _{k'=1}^K{a_{kk'}\textbf{h}_{k'}^{t-1}}, \sum _{k'=1}^K{a_{k'k}\textbf{h}_{k'}^{t-1}}\right] , \end{aligned}$$
(6)

where \(a_{kk'}\) denotes the correlation between nodes k and \(k'\). The hidden states \(\textbf{h}_{k}^{t}\) for weight k at time t are determined with a gating mechanism inspired by GRU (abbr. from the gated recurrent unit), which was introduced by [6]:

$$\begin{aligned} \begin{aligned} \textbf{z}_k^t=&{}\sigma (\textbf{W}^z{\textbf{a}_k^t}+\textbf{U}^z{\textbf{h}_k^{t-1}}), \\ \textbf{r}_k^t=&{}\sigma (\textbf{W}^r{\textbf{a}_k^t}+\textbf{U}^r{\textbf{h}_k^{t-1}}), \\ \widetilde{\textbf{h}_k^t}=&{}\tanh \left( \textbf{W}{\textbf{a}_k^t}+\textbf{U}({\textbf{r}_k^t}\odot {\textbf{h}_k^{t-1}})\right) , \\ \textbf{h}_k^t=&{}(1-{\textbf{z}_k^t}) \odot {\textbf{h}_k^{t-1}}+{\textbf{z}_k^t}\odot {\widetilde{\textbf{h}_k^t}}. \end{aligned} \end{aligned}$$
(7)

Here, \(\textbf{W}^z\) and \(\textbf{U}^z\) are the weights for the update gate, and \(\textbf{W}^r\) and \(\textbf{U}^r\) are the weights for the reset gate. The hyperbolic tangent function is given by tanh, whereas \(\sigma \) is the sigmoid function. The final weight \(\textbf{w}_k^{*}\) for class k is defined as:

$$\begin{aligned} \textbf{w}_k^{*} = o ( \textbf{h}_k^{T}, \textbf{h}_k^{0} ), \end{aligned}$$
(8)

where o is the fully connected layer.

Prediction and ensembling

The classifier \(f({\textbf{x}})\) is treated as a similarity metric between the output of the feature extractor and the most similar class prototypes learned by the knowledge graph transfer module. In the original KGTN approach, the relationship between these two was calculated using the inner product, cosine similarity or Person’s correlation coefficient. For the inner product, which was the most effective, the classifier was defined as

$$\begin{aligned} f_{k}(\textbf{x}) = \textbf{x} \cdot \textbf{w}_{k}^{*}, \end{aligned}$$
(9)

where x is the feature vector of an image and \(\textbf{w}_{k}^{*}\) denotes the learned weight for class k. Cosine similarity is defined as

$$\begin{aligned} f_{k}(\textbf{x}) = \left( \textbf{x} \cdot \textbf{w}_{k}^{*}\right) \cdot \left( \left\Vert \textbf{x}\right\Vert _2 \left\Vert \textbf{w}_{k}^{*}\right\Vert _2 \right) ^{-1}, \end{aligned}$$
(10)

whereas Person’s correlation coefficient is given by

$$\begin{aligned} f_{k}(\textbf{x}) = \left( \left( \textbf{x} - \bar{\textbf{x}} \right) \cdot \left( \textbf{w}_{k}^{*} - \bar{\textbf{w}}_{k}^{*} \right) \right) \left( \left( \left\Vert \textbf{x} - \bar{\textbf{x}}\right\Vert _{2} \right) \left( \left\Vert \textbf{w}_{k}^{*} - \textbf{w}_{k}^{*}\right\Vert _{2} \right) \right) ^{-1}. \end{aligned}$$
(11)

Here, \(\bar{\textbf{x}}\) and \(\bar{\textbf{w}}_{k}^{*}\) are respectively the mean values of \(\textbf{x}\) and \(\textbf{w}_{k}^{*}\) repeated to match the shape of \(\textbf{x}\) and \(\textbf{w}_{k}^{*}\). Conventionally,

$$\begin{aligned} f({\textbf{x}}) = \underset{k}{\arg \max }\ f_{k}(\textbf{x}). \end{aligned}$$
(12)

However, in our approach, we use the ensembling-inspired technique to improve the performance of the classifier.

In KGTN-ens, we calculate similarity for each of the m available graphs. Using a similar inner product approach, this is done the following way: \(f_{k,m}(\textbf{x}) = \textbf{x} \cdot \textbf{w}_{k,m}^{*}\), where \(\textbf{w}_{k,m}^{*}\) is the learned weight for the m-th graph. Then, the final result for class k has to be chosen. Such an approach is inspired by ensemble learning strategies, though we do not use weak learners in a strict sense. One of the main drawbacks of ensemble learning – the linear memory complexity with the proportional computational burden – is partially avoided, as only the part of the network is multiplied. Most importantly, the feature extractor, which often can be the largest component of modern architectures, is used only once. This enables us to fit several knowledge sources on proprietary GPUs (we used a single NVIDIA RTX 2080 Ti in our experiments). We propose two simple approaches for selecting the final result: mean and maximum. For the former, the result for class k is the mean of the m products:

$$\begin{aligned} f_{k}(\textbf{x}) = \frac{1}{M} \sum _{m=1}^{M} f_{k,m}(\textbf{x}). \end{aligned}$$
(13)

In ensemble learning literature, this would be called soft voting. The maximum approach is very similar:

$$\begin{aligned} f_{k}(\textbf{x}) = \max _{m=1}^{M} \left( f_{k,m}(\textbf{x}) \right) , \end{aligned}$$
(14)

In other words, we take the maximum of the similarities for each of the m available graphs.

Optimisation

To enable fair comparison, we use a two-step training regime similar to [14] and [2] – the first is devoted to the feature extractor, whereas the second one fine-tunes the graph-related part of the network. In the first stage, we train the feature extractor \(\phi (\cdot )\) using the base classes from \(\mathcal {D}_{base}\). The loss \(\mathcal {L}_1\) calculated in this step consists of the standard cross-entropy loss and squared gradient magnitude loss [14], which acts as a regularisation term:

$$\begin{aligned} \mathcal {L}_1 = \mathcal {L}_c + \lambda \mathcal {L}_s, \end{aligned}$$
(15)

where:

$$\begin{aligned} \mathcal {L}_c&= - \frac{1}{N_\text {base}} \sum _{i=1}^{N_\text {base}} \sum _{i=1}^{K_\text {base}} \mathbbm {1}_{k = y_i} \log p_i^k,\\ \mathcal {L}_s&= \frac{1}{N_\text {base}} \sum _{i=1}^{N_\text {base}} \sum _{i=1}^{K_\text {base}} \left( p_i^k - \mathbbm {1}_{k = y_i} \right) \left\Vert \textbf{x}_i \right\Vert ^2_2, \end{aligned}$$
(16)

where \(\mathbbm {1}\) is the indicator function and \(\lambda \) is a loss balance parameter. In the second stage, the weights of the feature extractor are frozen. Other parts of the architecture are trained using base and novel samples with the following loss:

$$\begin{aligned} \mathcal {L}_2 = - \frac{1}{N} \sum _{i=1}^{N} \sum _{i=1}^{K} \mathbbm {1}_{k = y_i} \log p_i^k + \eta \sum _{k=1}^{K} \left\Vert \textbf{w}_k^{*} \right\Vert ^2_2, \end{aligned}$$
(17)

where \(\eta \) balances the loss components.

4 Evaluation

This section contains the results of the conducted experiments. First, we introduce the used knowledge sources – semantic similarity graph, WordNet and Wikidata. Then, we describe the evaluation of KGTN-ens with different combinations of embeddings and compare them with the previous work. Finally, we provide a detailed analysis and ablation studies.

4.1 Knowledge sources

In our evaluation, we use three different sources of knowledge, which can be the backbone of KGTMs: hierarchy, glove, and wiki. The first two have been proposed by [2]. The wiki graph is constructed on top of Wikidata, a collaborative knowledge graph connected to Wikipedia [34]. In this subsection, we discuss the preparation of these knowledge sources in detail.

Semantic similarity graph (glove)

The first source of knowledge is built from GLoVe word embeddings [26]. For two words \(w_i\) and \(w_j\), their semantic distance \(d_{i,j}\) is defined as the Euclidean distance between their GLoVe embeddings \(\textbf{f}_i^w\) and \(\textbf{f}_j^w\). Following [2], the final correlation coefficient \(a_{i,j}\) is obtained using the following function:

$$\begin{aligned} a_{i,j} = \lambda ^{d_{i,j} - \min \left\{ d_{i,k} \mid k \ne i \right\} )}, \end{aligned}$$
(18)

where \(\lambda =0.4\) and \(a_{i,i}=1\).

WordNet category distance (hierarchy)

This source of knowledge is built from the WordNet hierarchy – a popular lexical database of English [22]. Since ImageNet classes are based on WordNet, the WordNet hierarchy can be used to measure the distance between two classes. This time the distance \(d_{i,j}\) is defined as the number of common ancestors of the two words (categories) \(w_i\) and \(w_j\). The output is processed similarly to (18), except that the \(\lambda \) parameter is set to 0.5.

Wikidata embeddings (wiki)

The last source of knowledge is built from the Wikidata embeddings. The mapping between the ImageNet classes and Wikidata is provided by [9]. Having the mapping, the class-corresponding entities from Wikidata can be embedded and used as class prototypes. Although there exist some datasets of Wikidata embeddings, they are often incomplete. Most importantly, they do not contain all the embeddings of ImageNet classes. Wembedder [24] offers 100-dimensional Wikidata embeddings made using the word2vec algorithm [21], but it is based on an incomplete dump of Wikidata and does not contain all the classes needed in the ImageNet-FS dataset. Zhu et al. [38] proposed Graphvite, a general graph embedding engine. Wikidata5m is a large dataset of 5 million Wikidata entities, which is used to train the embeddings. The framework comes with embeddings created using numerous popular algorithms, such as TransE, DistMult, ComplEx, SimplE, RotatE, and QuatE. However, 891 out of 1000 entities used in the ImageNet are embedded, which was not enough for performing the experiment.

We used the pre-trained 200-dimensional embeddings of Wikidata entities from PyTorch BigGraph [18], which are publicly availableFootnote 2. The embeddings were prepared using the full Wikidata dump from 2019-03-06. All but three entities were directly mapped to embeddings to their Wikidata ID. Three entities (Q1295201, Q98957255, Q89579852) could not be instantly matched – they were manually matched to "grocery store"@en, "cricket"@en, and Q655301 respectively. Having the mapping, now we create an embedding array, ordered as the mappings in the original KGTN paper (that is, as a \(1000 \times 200\) array, where 200 denotes the dimensionality of a single embedding). The same function from (18) was used to generate final correlations between the embeddings, although this time \(\lambda =0.32\) was used (see Section 4.3).

4.2 Experiment results

In this subsection, we present the results of the conducted experiments. We describe the evaluation data – the experiment has been conducted on the ImageNet-FS dataset. The training hyperparameters and the setup are also described. We also describe the evaluation protocol, as well as the evaluation metrics. Finally, we present the results of the experiments and compare them with the previous work.

Data

Similarly to Chen et al., our approach has been evaluated on ImageNet-FS, a popular benchmark for the few-shot learning task. ImageNet-FS contains 1,000 classes from ImageNet Large Scale Visual Recognition Challenge 2012 [27], of which 389 belong to the base category and 611 to the novel category. 193 base classes and 300 novel classes are used for training and cross-validation, whereas the test phase is performed on the remaining 196 categories and 311 novel classes. Base categories consist of around 1280 train and 50 test images per class. The authors of KGTN also evaluated their solution against a larger dataset, ImageNet-6K, which contains 6,000 classes (of which 1,00 belong to the novel category). Unfortunately, we were unable to test KGTN-ens using this dataset, since it has not been made public nor available to us at the time of writing this paper.

Training

To enable fair comparison, we used the same 2-step training and evaluation procedures as in KGTN [2], which was drawn on SGM [14]. Stochastic gradient descent (SGD) was used to train the model with a batch size equal to 256 (divided equally for base and novel classes), a momentum of 0.9, and a weight decay of 0.0005. The learning rate is initially set at 0.1 and divided by 30 at every 30 epochs. In general, we used the same feature extractor and hyperparameters as in KGTN unless stated otherwise. Specifically, we use ResNet-50 – same as in e.g. KGTN and SGM. Using the terminology from the ResNet paper [15], we use features from the output of the last convolutional block in the last stage. For ResNet-50, that’s conv5_3, as the 4th stage consists of the three blocks.

Setup

All the experiments have been conducted on a single NVIDIA GeForce RTX 2080 Ti GPU. We used the code released by the authors of KGTN and modified it to support the KGTN-ens approach. PyTorch [25] was used to conduct the experiments.

Fig. 2
figure 2

KGTN-ens (blue, 5 runs averaged) performance mean top-5 accuracy compared to the KGTN (orange) and SGM with graph regularisation [29] (green). KGNT-ens uses glove and hierarchy graphs combined with the max ensembling function. Horizontal lines indicate standard deviations (not available for SGM with graph regularisation)

Table 1 Qualitative results (only novel) for the KGTN-ens models trained with ResNet50, max ensembling method and inner product similarity function

Evaluation

Following previous work in few-shot learning, we report our evaluation results in terms of the top-5 accuracy of novel and all (base + novel) classes in the k-shot learning task, where \(k \in \{1,2,5, 10\}\) is the number of available training samples belonging to the novel category. Following [14] and [2], we repeat each experiment five times and report the averaged values of the top-5 accuracy. Table 2 shows the classification results compared with some of the recent state-of-the-art benchmarks. Figure 2 presents the top-5 accuracy of the KGTN-ens model on ImageNet-FS. Of three possible combinations of the three sources of knowledge, the KGTN-ens model performed best with the combination of hierarchy and glove. Notably, it performed better than KGTN with these two sources of knowledge alone. Compared to KGTN (with inner product similarity and glove embeddings), the KGTN-ens model (inner product, max ens. function, glove and hierarchy embeddings) achieved +0.63, +0.58, +0.43, +0.26 pp. top-5 accuracy on novel classes for \(k \in \{ 1, 2, 5, 10\}\) respectively. The smaller the k, the higher the performance gain. It also beats the more recent graph-based framework proposed by [29] by +1.73/+1.18/+0.20 pp. top-5 accuracy on novel classes. For all classes, the KGTN-ens model achieved +0.26, +0.25, +0.32, –0.04 pp. top-5 accuracy compared to the same KGTN model for \(k \in \{ 1, 2, 5, 10\}\) respectively. Qualitative results for the best model are presented in Tables 1, and 2.

Table 2 Top-5 accuracy on novel and all subsets on ImageNet-FS. All the methods used ResNet-50 as a feature extractor

4.3 Details, ablation studies, and discussion

This subsection provides more details on the KGTN-ens model and its ablation studies. We analyse the impact of the following factors on the performance of the KGTN-ens model: adjacency matrices, used embeddings, ensembling method, similarity function, and variance of the results (Table 3).

Table 3 Descriptive statistics about the adjacency matrices
Fig. 3
figure 3

Adjacency matrix distributions, raw values (normal and log scales)

Fig. 4
figure 4

Adjacency matrix distributions, processed values (normal and log scales)

Adjacency matrix analysis

Since glove knowledge graph was the most effective for KGTN, we assume that wiki should roughly resemble it in terms of its distribution. In order to investigate the similarity between distributions, adjacency matrices have been created using pairwise Euclidean distances. While glove and wiki are normal-like, the distribution for hierarchy is bimodal and most of the distances are the highest ones (Table 4, Figs. 3 and 4). To assess the correlation between adjacency matrices, Mantel tests have been performed. The values marked as processed were run through (18). Correlations of the processed matrices are visibly higher compared to raw ones, especially regarding glove and wiki. The highest correlation has been observed between glove and wiki.

Importance of used knowledge graphs

Firstly, we analyse the influence of the used KGs separately (without ensembling) – that is, with the original KGTN architecture. Table 5 shows the results of the ablation studies on the three knowledge graphs. The hierarchy and glove knowledge graphs are the ones examined by [2], whereas the wiki knowledge graph is the one introduced in our experiments. In order to ensure that the advantage comes from the knowledge encoded in KGs, Chen et al. argue that glove and hierarchy embeddings perform better than uniform (all correlations set to 1/K) and random (correlations drawn from the uniform distributions) distance matrices. Similarly, the usage of wiki knowledge graph yielded generally better results (up to +3.44 pp for 1-shot in the novel category) compared to random and uniform cases, which constitutes a noticeable improvement. However, compared to glove and hierarchy, the wiki knowledge graph yields worse results – notably for low-shot scenarios. We hypothesise that the difference in the performance of wiki knowledge graph is due to the low quality of embeddings, as some issues regarding their accuracy were previously reportedFootnote 3. Similarly, hierarchy is inherently biased since the categories sharing more common ancestors are promoted by the distance function. Therefore, the quality of hierarchy is questionable, especially in scenarios with a small number of classes or corner cases with classes not sharing any ancestors. In general, we hypothesise the final performance depends largely on the quality of the embeddings. In the scenario with a large number of knowledge graphs, there is also the question of how these embeddings complete each other. This means that KGTN with optimal embeddings should perform no worse than KGTN-ens with a large number of suboptimal embeddings. In other words, the quality of the embeddings is perhaps more important than the number of knowledge graphs.

Table 4 Mantel test results
Table 5 Classification results for KGTN on different embeddings tested against different similarity functions
Table 6 Knowledge graph ensembling (sum), top-5 accuracy, inner product

Importance of the ensembling method

Table 6 presents results for the different ensembling strategies compared to the KGTN baseline, which can be treated as a KGTN-ens model with no ensembling. Mean ensembling gave mixed results compared to the baseline (\(+0.34\), \(-0.63\), \(+0.36\), \(-0.27\) pp. for novel classes and \(-1.45\), \(-1.41\), \(+0.14\), \(-0.17\) pp. for all classes, both groups for \(k \in \{1, 2, 5, 10\}\) respectively. However, using the max ensembling strategy has been better in all the cases (\(+0.77\), \(+0.40\), \(+0.29\), \(+0.07\) pp. for novel classes and \(+0.24\), \(+0.18\), \(+0.19\), \(+0.06\) pp. for all classes). A possible explanation of this effect might stem from the winner takes all nature of the maximum function, which chooses the most similar embedding to the given prototype and rejects other, potentially improper, embeddings. At the same time, these improper embeddings still contribute to the overall formula for the mean ensembling function. However, research on a larger number of employed knowledge graphs has to be conducted to validate this hypothesis.

Variance of the results

Contrary to expectations, adding additional knowledge sources slightly increases the variance of the results in most cases (Table 7). A possible explanation of these results is the fact that KGTN-ens is not an ensembling technique in the typical sense of this word, but rather a way of choosing the embeddings of the different knowledge sources. We report results for novel classes only, as the difference in variance is amplified among these (see also Fig. 2). No significant differences in the variance of mean and max ensembling have been found. The variance of the results for baseline KGTN has been obtained using five runs of the original KGTN with glove embeddings.

Importance of similarity function

Table 5 includes data for performing ablative studies for KGTN with the three different similarity functions: cosine similarity, inner product and Pearson correlation. [2] analysed all these for KGTN with glove embeddings. In general, the inner product showed the best performance. These conclusions can be extrapolated to the wiki graph, as the inner product usually turned out to be the most effective in terms of the top-5 accuracy. Interestingly, Pearson correlation displayed the best performance for the 1-shot scenario with novel classes. Table 8 presents results for the different similarity functions used in the KGTN-ens. While the combination of hierarchy and glove embeddings was usually the best for cosine similarity as well, the results are visibly worse compared to the inner product similarity function (e.g. \(-3.16\) pp. top-5 accuracy difference for 1-shot scenario among novel classes). Noticeably, the combination of these two graphs and cosine similarity function performed worse than KGTN solely based on glove embeddings (for example, there is a \(-2.39\) pp. difference for top-5 accuracy difference for 1-shot scenario among novel classes).

Table 7 Standard deviations of top-5 accuracy
Table 8 Ablation on different similarity functions used in KGTN-ens, top-5 accuracy
Table 9 Means and standard deviations of class-wise entropy of predictions for KGTN-ens (ResNet50, inner product, max) with different combinations of embeddings (novel classes)
Table 10 Means and standard deviations of class-wise entropy of predictions for KGTN-ens (ResNet50, inner product, max) with different combinations of embeddings (all classes)

Importance of feature extractor

So far, we have reported ResNet-50 as a feature extractor, as this is the standard feature extractor in the few-shot learning literature. However, we also performed another set of experiments with ResNet-10 and ResNet-18. All of the results are compared to the KGTN (with glove KG). Similarly to ResNet-50, these models have been used and trained in the same way as KGTN [2]. This also means that SGM without generation [14] serves here as the baseline, as both KGTN and KGTN-ens are based on it. The detailed results (averaged on 5 runs) for every experiment we performed in this experiment are provided in Table 13 and 14 for ResNet-10 and ResNet-18 respectively in Appendix. We tested the KGTN-ens model with different feature extractors, similarity functions (cosine similarity, inner product, Pearson correlation coefficient), and ensembling methods (max, mean) for the same 5 runs as in the main experiment. Since that gives 96 experiments for each run, we also grouped results by the used feature extractor, k, and subset (novel/all) in order to showcase the best model in each group. For the same 5 runs as in the main experiment, we present averaged and maximum results in Tables 11 and 12 respectively.

For ResNet10, we can observe that the KGTN-ens model slightly outperformed the KGTN for more than half cases. Specifically, the best KGTN-ens model yielded \(-0.15\)/\(-0.00\)/0.10/\(-0.03\) (novel) and \(+0.02\)/\(+0.1\)/\(+0.14\)/\(+0.21\) (all) for \(k=\{1, 2, 5, 10\}\) respectively (averaged on 5 different runs). While the KGTN-ens model was better slightly more often, the differences are smaller than for ResNet-50. Since the variance of the results was higher for ResNet-10, we also present the maximum results in Table 11. Taking only the best result from 5 runs, KGTN-ens yielded \(-0.11\)/\(+0.23\)/\(+0.16\)/\(+0.23\) and \(+0.03\)/\(+0.11\)/\(+0.37\)/\(+0.30\) over KGTN for novel/all classes respectively. Interestingly, both KGTM and KGTM-ens failed to outperform the baseline for \(k=5\) in novel classes. Among the best KGTN-ens models, hierarchy + glove was the winning knowledge graph combination most of the time. Interestingly, cosine similarity was usually the best similarity function for ResNet-10. In terms of the ensembling method, mean yielded the best results for \(k=\{1, 2\}\), whereas max was the best for \(k=\{5, 10\}\) (except one result on novel data). With regard to EM and SF, similar behaviour can be observed among runs with maximum results. In terms of comparing the best KGTM-ens models KGTM for ResNet-18, the mean top-5 accuracy differences were \(-1.48\)/\(-0.52\)/\(-0.21\)/\(+0.08\) and \(-0.70\)/\(+0.10\)/\(-0.03\)/\(+0.25\) (\(-0.45\)/\(+0.13\)/\(-0.20\)/\(+0.22\) and \(-1.56\)/\(-0.37\)/\(-0.29\)/\(+\) 0.09 for the best run). This time KGTN-ens performed better only three times. Especially the results for \(k=1\) are far beyond expectations. While results for ResNet-18 look less favourably for the method we present, we report them for the sake of scientific integrity. Interestingly, for \(k=5\) and \(k=10\), the best KGTN-ens model used wiki + glove knowledge graphs. In terms of similarity function and ensembling method, inner product and max were the best for all the cases. In general, a possible conclusion from these findings is that the effectiveness of the ensembling method and similarity function depends on the feature extractor.

Discussion on possible improvements

Obtaining good class boundaries is a challenging task in few-shot learning. At the same time, the diversity of the embeddings plays a vital role in ensemble learning scenarios. This paragraph discusses possible ways of improving the performance of KGTN-ens, which are drawn mostly from ensemble learning literature. In our paper, we focused on KGTN-ens working with different knowledge graphs. However, the same knowledge graphs can be used for different KGTMs and this is a very interesting idea to explore in the future. Since the overall variance of the model is non-negligible (especially in smaller variants of ResNet), this is the case where e.g. bagging can be used. One can also consider introducing learnable weights acting both on the prototypes and the features. Given that for some weights the optimal values might be 0, this also embraces the idea of feature selection. Another idea is to randomly select a subset of embeddings for each training epoch, and then introduce e.g. Gaussian noise to the embeddings. To assess the diversity of the model (or family of the models – see below), uncertainty measures such as entropy can be used. We provide the averaged entropy per class for KGTN-ens predictions with different combinations of embeddings in Tables 9 and 10, and they reflect the top-5 accuracy. For future work, one can also calculate e.g. permutation importance to assess the influence of a particular KGTM on the final score.

5 Conclusion

In this work, we proposed KGTN-ens, which builds on KGTN and allows the incorporation of multiple knowledge sources in order to achieve better performance. We evaluated KGTN-ens on the ImageNet-FS dataset and showed that it outperforms KGTN in most of the tested settings. We also evaluated Wikidata embeddings in the same task and showed that they are not as effective as the other embeddings. We believe that the proposed approach can be used in other few-shot learning tasks and we plan to test it in the future. Although not publicly available at the time of writing this article, further work might include an evaluation of the proposed approach on the ImageNet-6K dataset [2]. A certain limitation of this study is the fact that it might not scale well for extreme classification problems, due to the calculation of pairwise distances of nodes from large knowledge graphs requiring quadratic memory complexity.

Table 11 Best models (mean of 5 different runs) for each FE, type and k
Table 12 Best models (max of 5 runs) for each FE, type and k