1 Introduction

Current artificial intelligence techniques cannot rapidly generalize from a few examples. This common inability stems from the fact that most deep neural networks must be trained on large-scale data. In contrast, humans can learn new tasks quickly by utilizing what they learned in the past. Few-shot learning models try to fill this gap by learning how to learn from a limited number of examples. Few-shot learning is the problem of making predictions based on a few labeled examples. The goal of few-shot learning is not to recognize a fixed set of labels but to quickly adapt to new tasks with a small amount of training data. After training, the model can classify new data using only a few training examples.

Two novel few-shot learning techniques have recently emerged. The first is based on the kernel methods and Gaussian processes [1,2,3]. The universal deep kernel has enough data to generalize well to unseen tasks without overfitting. The second technique makes use of the Hypernetworks [4,5,6,7,8,9], which allow to aggregate information from the support set and produce dedicated network weights for new tasks.

The above approaches give promising results but also have some limitations. Kernel-based methods are not flexible enough, since they use Gaussian processes on top of the models. Moreover, it is not trivial to use Gaussian processes for classification tasks. On the other hand, Hypernetworks must aggregate information from the support set, and it is hard to model the relation between classes as opposed to classical feature extraction.

This paper introduces a general framework that combines the Hypernetworks paradigm with kernel methods to realize a new strategy that mimics the human way of learning. First, we examine the entire support set and extract the information in order to distinguish objects of each class. Then, based on the relations between their features, we create the decision rules.

Kernel methods realize the first part of the process. For each of the few-shot tasks, we extract the features from the support set through the backbone architecture and calculate kernel values between them. Then we use a Hypernetwork architecture [3, 4]—a neural network that takes kernel representation and produces decision rules in the form of a classifier. In our framework, the Hypernetwork aggregates the information from the support set and produces weights or adaptation parameters of the target model dedicated to the specific task, classifying the query set.

Fig. 1
figure 1

General architecture of our framework. First, the examples from a support set are sorted according to the corresponding class labels and transformed by encoding network \(E(\cdot )\) to obtain the matrix of ordered embeddings of the support examples, \({\textbf{Z}}_S\). The low-dimensional representations stored in \({\textbf{Z}}_S\) are further used to compute kernel matrix \({\textbf{K}}_{S,S}\). The values of the kernel matrix are passed to the hypernetwork \(H(\cdot )\) that creates the parameters \(\varvec{\theta }_{\tau }\) for the target classification module \(T(\cdot )\). We identify two options for generating \(\varvec{\theta }_\tau \): (i) HyperShot—hypernetwork H generates \(\varvec{\theta }_{\tau }\) straightforwardly; (ii) BayesHyperShot—H generates parameters \(\mu _{\tau }\) and \(\sigma _{\tau }\) of a Gaussian distribution, from which \(\varvec{\theta }_{\tau }\) can then be sampled. The query image \({\textbf{x}}\) is processed by encoder \(E(\cdot )\), and the vector of kernel values \({\textbf{k}}_{{\textbf{x}},S}\) is calculated between query embedding \({\textbf{z}}_{{\textbf{x}}}\) and the corresponding representations of support examples, \({\textbf{Z}}_S\). The kernel vector \({\textbf{k}}_{{\textbf{x}},S}\) is further passed to target model \(T(\cdot )\) to obtain the probability distribution for the considered classes

The models we propose inherit the flexibility from Hypernetworks and the ability to learn the relation between objects from kernel-based methods.

We consider two alternative approaches to the realization of our framework—the classical one, which we dubbed HyperShot, and the Bayesian version called BayesHyperShot. Firstly, we started with the classical approach.

We perform an extensive experimental study of HyperShot by benchmarking it on various one-shot and few-shot image classification tasks. We find that HyperShot demonstrates high accuracy in all tasks, performing comparably or better than the other recently proposed methods. Moreover, HyperShot shows a strong ability to generalize, as evidenced by its performance on cross-domain classification tasks.

Unfortunately HyperShot, similar to other few-shot algorithms, suffers from limited data size, which may result in drawbacks such as poorly quantified uncertainty. To solve this problem, we extend our method by incorporating Bayesian neural network and Hypernetwork paradigms, which results in a Bayesian version of our general framework—BayesHyperShot. In practice, hypernetwork produces parameters of probability distribution on weights. In this paper, we consider Gaussian prior, but our framework can be used for modeling even more complex priors.

The contributions of this work are fourfold:

  • In this paper, we propose a general framework that realizes the learn how to learn paradigm by modeling learning rules which are not based on gradient optimization and can produce completely different decision strategies.

  • We propose a new approach to solve the few-shot learning problem by aggregating information from the support set by kernel methods and directly producing weights from the neural network dedicated to the query set.

  • We propose HyperShot model, which combines the Hypernetworks paradigm with kernel methods classically, to produce the weights dedicated for each task.

  • We show that our model can be generalized to Bayesian version BayesHyperShot, which allows for solving problems with poorly quantified uncertainty.

2 Hypernetworks for few-shot learning–a general framework

In this section, we present our general framework for utilizing the Hypernetworks and kernel-based methods for few-shot learning. In the beginning, we give a quick recap of the necessary background. Then, we introduce our framework by presenting the HyperShot. Finally, we show how to incorporate the Bayesian approach in this general framework and present the BayesHyperShot.

2.1 Background

Few-shot learning The terminology describing the few-shot learning setup is dispersive due to the colliding definitions used in the literature. For a unified taxonomy, we refer the reader to [10, 11]. Here, we use the nomenclature derived from the meta-learning literature, which is the most prevalent at the time of writing. Let:

$$\begin{aligned} {\mathcal {S}} = \{ ({\textbf{x}}_l, {\textbf{y}}_l) \}_{l=1}^L \end{aligned}$$

be a support set containing input–output pairs, with L examples with the equal class distribution. In the one-shot scenario, each class is represented by a single example, and \(L=K\), where K is the number of the considered classes in the given task. For few-shot scenarios, each class usually has from 2 to 5 representatives in the support set \({\mathcal {S}}\). Let:

$$\begin{aligned} {\mathcal {Q}} = \{ ({\textbf{x}}_m, {\textbf{y}}_m) \}_{m=1}^M \end{aligned}$$

be a query set (sometimes referred to in the literature as a target set), with M examples, where M is typically one order of magnitude greater than K. For ease of notation, the support and query sets are grouped in a task \({\mathcal {T}} = \{{\mathcal {S}}, {\mathcal {Q}} \}\). During the training stage, the models for few-shot applications are fed by randomly selected examples from training set \({\mathcal {D}} = \{{\mathcal {T}}_n\}^N_{n=1}\), defined as a collection of such tasks.

During the inference stage, we consider task \({\mathcal {T}}_{*} = \{{\mathcal {S}}_{*}, {\mathcal {X}}_{*}\}\), where \({\mathcal {S}}_{*}\) is a support set with the known class values for a given task, and \({\mathcal {X}}_{*}\) is a set of query (unlabeled) inputs. The goal is to predict the class labels for query inputs \({\textbf{x}} \in {\mathcal {X}}_*\), assuming support set \({\mathcal {S}}_{*}\) and using the model trained on \({\mathcal {D}}\).

Hypernetwork In the canonical work [4], hypernetworks are defined as neural models that generate weights for a separate target network solving a specific task. The authors aim to reduce the number of trainable parameters by designing a hypernetwork with a smaller number of parameters than the target network. Making an analogy between hypernetworks and generative models, the authors of [12] use this mechanism to generate a diverse set of target networks approximating the same function.

2.2 HyperShot overview

We introduce HyperShot—a model that utilizes hypernetworks for few-shot problems. The main idea of the proposed approach is to predict the values of the parameters for a classification network that makes predictions on the query images given the information extracted from support examples for a given task. Thanks to this approach, we can switch the classifier’s parameters between completely different tasks based on the support set. The information about the current task is extracted from the support set using a parameterized kernel function that operates on embedding space. Thanks to this approach, we use relations among the support examples instead of taking the direct values of the embedding values as an input to the hypernetwork. Consequently, this approach is robust to the embedding values for new tasks far from the feature regions observed during training. The classification of the query image is also performed using the kernel values calculated with respect to the support set.

The architecture of HyperShot is provided in Fig. 1. We aim to predict the class distribution \(p({\textbf{y}} \mid S,{\textbf{x}}),\) given a query image \({\textbf{x}}\) and set of support examples \(S = \{ ({\textbf{x}}_l, y_l) \}_{l=1}^K.\) First, all images from the support set are grouped by their corresponding class values. Next, each of the images \({\textbf{x}}_l\) from the support set is transformed using encoding network \(E(\cdot )\), which creates low-dimensional representations of the images, \(E({\textbf{x}}_l)={\textbf{z}}_l\). The constructed embeddings are sorted according to class labels and stored in the matrix \({\textbf{Z}}_S=[{\textbf{z}}_{\pi (1)}, \dots , {\textbf{z}}_{\pi (K)}]^\textrm{T}\), where \(\pi (\cdot )\) is the bijective function, that satisfies \(y_{\pi (l)} \le y_{\pi (k)}\) for \(l \le k\).

In the next step, we calculate the kernel matrix \({\textbf{K}}_{S, S}\), for vector pairs stored in rows of \({\textbf{Z}}_S\). To achieve this, we use the parametrized kernel function \(k(\cdot , \cdot )\), and calculate \(k_{i,j}\) element of matrix \({\textbf{K}}_{S, S}\) in the following way:

$$\begin{aligned} k_{i,j} = k({\textbf{z}}_{\pi (i)}, {\textbf{z}}_{\pi (j)}). \end{aligned}$$

The kernel matrix \({\textbf{K}}_{S, S}\) represents the extracted information about the relations between support examples for a given task. The matrix \({\textbf{K}}_{S, S}\) is further reshaped to the vector format and delivered to the input of the hypernetwork \(H(\cdot )\). The role of the hypernetwork is to provide the parameters \(\varvec{\theta }_T\) of target model \(T(\cdot )\) responsible for the classification of the query object. Thanks to that approach, we can switch between the parameters for entirely different tasks without moving via the gradient-controlled trajectory, like in some reference approaches like MAML.

We base the architecture of the Hypernetwork on a multilayer perceptron (MLP). Specifically, the HyperNetwork first processes the kernel matrix \({\textbf{K}}_{S, S}\) through an MLP (dubbed the “neck”), whose output is then processed by “heads”—separate MLPs dedicated to producing the weights of each target network layer—see Fig. 4 for a visualization. We summarize the parameters of Hypernetworks and target networks used in our experiments in Table 8.

The query image \({\textbf{x}}\) is classified in the following manner. First, the input image is transformed to low-dimensional feature representation \(z_{{\textbf{x}}}\) by encoder \(E({\textbf{x}})\). Further, the kernel vector \({\textbf{k}}_{{\textbf{x}},S}\) between the query embedding and sorted support vectors \({\textbf{Z}}_S\) is calculated in the following way:

$$\begin{aligned} {\textbf{k}}_{{\textbf{x}}, S} = [k({\textbf{z}}_{{\textbf{x}}}, {\textbf{z}}_{\pi (1)}),\dots ,k({\textbf{z}}_{{\textbf{x}}}, {\textbf{z}}_{\pi (K)})]^{\textrm{T}}. \end{aligned}$$

The vector \({\textbf{k}}_{{\textbf{x}}, S}\) is further provided on the input of target model \(T(\cdot )\) that is using the parameters \(\varvec{\theta }_{T}\) returned by hypernetwork \(H(\cdot )\). The target model returns the probability distribution \(p({\textbf{y}} \mid S, {\textbf{x}})\) for each class considered in the task.

The function \(\pi (\cdot )\) enforces some ordering of the input delivered to \(T(\cdot )\). Practically, any other permutation of the classes for the input vector \({\textbf{k}}_{{\textbf{x}}, S}\). In such a case, the same permutation should be applied to rows and columns of \({\textbf{K}}_{S,S}\). As a consequence, the hypernetwork is able to produce the dedicated target parameters for each of the possible permutations. Although this approach does not guarantee the permutation invariance for real-life scenarios, thanks to dedicated parameters for any ordering of the input, it should be satisfied for major cases.

Fig. 2
figure 2

Simple 2D example illustrating the application of cosine kernel for HyperShot. We consider the two support examples from different classes represented by vectors \(\mathbf {f_1}\) and \(\mathbf {f_2}\). For this simple scenario, the input of hypernetwork is represented simply by the cosine of \(\alpha \), which is an angle between vectors \({\textbf{f}}_1\) and \({\textbf{f}}_2\). We aim at classifying the query example \({\textbf{x}}\) represented by a vector \({\textbf{f}}_{{\textbf{x}}}\). Considering our approach, we deliver to the target network \(T(\cdot )\) the cosine values of angles between first (\(\alpha _{{\textbf{x}},1}\)) and second (\(\alpha _{{\textbf{x}},2}\)) support vectors and classify the query example using the weights \(\varvec{\theta }_{T}\) created by hypernetwork \(H(\cdot )\) from \(\cos {\alpha }\) (remaining components on the diagonal of \({\textbf{K}}_{S,S}\) are constant for cosine kernel)

2.3 Kernel function

One of the key components of our approach is a kernel function \(k(\cdot , \cdot )\). In this work, we consider the dot product of the transformed vectors given by:

$$\begin{aligned} k({\textbf{z}}_1, {\textbf{z}}_2)={\textbf{f}}({\textbf{z}}_1)^{\textrm{T}} {\textbf{f}}({\textbf{z}}_2), \end{aligned}$$

where \({\textbf{f}}(\cdot )\) can be a parametrized transformation function, represented by MLP model, or simply an identity operation, \({\textbf{f}}({\textbf{z}})={\textbf{z}}\). In Euclidean space, this criterion can be expressed as \(k({\textbf{z}}_1, {\textbf{z}}_2)=||{\textbf{f}}({\textbf{z}}_1)|| \cdot ||{\textbf{f}}({\textbf{z}}_2)|| \cos {\alpha }\), where \(\alpha \) is an angle between vectors \({\textbf{f}}({\textbf{z}}_1)\) and \({\textbf{f}}({\textbf{z}}_2)\). The main feature of this function is that it considers the vectors’ norms, which can be problematic for some tasks that are outliers regarding the representations created by \({\textbf{f}}(\cdot )\). Therefore, we consider in our experiments also the cosine kernel function given by:

$$\begin{aligned} k_{c}({\textbf{z}}_1, {\textbf{z}}_2)=\frac{{\textbf{f}}({\textbf{z}}_1)^{\textrm{T}}{\textbf{f}}({\textbf{z}}_2)}{{\textbf{f}}({\textbf{z}}_1) \cdot {\textbf{f}}({\textbf{z}}_2)}, \end{aligned}$$

that represents the normalized version dot product. Considering the geometrical representation, \(k_{c}({\textbf{z}}_1, {\textbf{z}}_2)\) can be expressed as \(\cos {\alpha }\) (see the example given by Fig. 2). The support set is represented by two examples from different classes, \({\textbf{f}}_1\) and \({\textbf{f}}_2\). The target model parameters \(\varvec{\theta }_{T}\) are created based only on the cosine value of the angle between vectors \({\textbf{f}}_1\) and \({\textbf{f}}_2\). During the classification stage, the query example is represented by \({\textbf{f}}_{{\textbf{x}}}\), and the classification is applied on the cosine values of angles between \({\textbf{f}}_{{\textbf{x}}}\) and \({\textbf{f}}_1\), and \({\textbf{f}}_{{\textbf{x}}}\) and \({\textbf{f}}_2\), respectively.

figure a

2.4 Training and prediction

The training procedure assumes the following parametrization of the model components. The encoder \(E:=E_{\varvec{\theta }_E}\) is parametrized by \(\varvec{\theta _E}\), the hypernetwork \(H=H_{\varvec{\theta }_H}\) by \(\varvec{\theta }_H\) and the kernel function k by \(\varvec{\theta }_k\). We assume that training set \({\mathcal {D}}\) is represented by tasks \({\mathcal {T}}_i\) composed of support \({\mathcal {S}}_i\) and query \({\mathcal {Q}}_i\) examples. The training is performed by optimizing the cross-entropy criterion:

$$\begin{aligned} L = \sum _{{\mathcal {T}}_i \in {\mathcal {D}}} l_{{\mathcal {T}}_i} = \sum _{{\mathcal {T}}_i \in {\mathcal {D}}} \sum _{m=1}^M \sum _{k=1}^K y_{i, m}^k -\log p(y_{i, m}^k \mid {\mathcal {S}}_i, {\textbf{x}}_{i, m}),\nonumber \\ \end{aligned}$$

where \(({\textbf{x}}_{i,n}, {\textbf{y}}_{i, n})\) are examples from query set \({\mathcal {Q}}_i\), where \(Q_i=\{({\textbf{x}}_{i,m}, {\textbf{y}}_{i, m})\}_{m=1}^M\). The distribution for currently considered classes \(p({\textbf{y}} \mid {\mathcal {S}}, {\textbf{x}})\) is returned by target network T of HyperShot. During the training, we jointly optimize the parameters \(\varvec{\theta }_H\), \(\varvec{\theta }_k\) and \(\varvec{\theta }_E\), minimizing the L loss.

During the inference stage, we consider the task \({\mathcal {T}}_*\), composed of a set of labeled support examples \({\mathcal {S}}_*\) and a set of unlabeled query examples represented by input values \({\mathcal {X}}_*\) that the model should classify. We can simply take the probability values \(p({\textbf{y}} \mid {\mathcal {S}}_{*}, {\textbf{x}})\) assuming the given support set \({\mathcal {S}}_*\) and single query observation \({\textbf{x}}\) from \({\mathcal {X}}_*\), using the model with trained parameters \(\varvec{\theta }_H\), \(\varvec{\theta }_k\), and \(\varvec{\theta }_E\). However, we observe that slightly better results are obtained while adapting the model’s parameters on the considered task. We do not have access to labels for query examples. Therefore, we imitate the query set for this task simply by taking support examples and creating the adaptation task \({\mathcal {T}}_i=\{{\mathcal {S}}_*,{\mathcal {S}}_*\}\) and updating the parameters of the model using several gradient iterations. The detailed presentation o training and prediction procedures is provided by Algorithm 1.

2.5 Adaptation to few-shot scenarios

The proposed approach uses the ordering function \(\pi (\cdot )\) that keeps the consistency between support kernel matrix \({\textbf{K}}_{S,S}\) and the vector of kernel values \({\textbf{k}}_{{\textbf{x}}, S}\) for query example \({\textbf{x}}\). For few-shot scenarios, each class has more than one representative in the support set. As a consequence, there are various possibilities to order the feature vectors in the support set inside the considered class. To eliminate this issue, we follow [8] and propose to apply the aggregation function to the embeddings \({\textbf{z}}\) considering the support examples from the same class. Thanks to this approach, the kernel matrix is calculated based on the aggregated values of the latent space of encoding network E, making our approach independent of the ordering among the embeddings from the same class. In experimental studies, we examine the quality of mean aggregation operation (averaged) against simple class-wise concatenation of the embeddings (fine-grained) in ablation studies.

2.6 Extension to Bayesian setting—BayesHyperShot

Finally, we introduce the Bayesian extension of HyperShot, where the distribution over the parameters is represented by Gaussian prior:

$$\begin{aligned} p(\varvec{\theta }_{\mathcal {T}}) = {\mathcal {N}}(\varvec{\theta }_{\mathcal {T}} \mid 0, I). \end{aligned}$$

Thanks to that assumption, the target model \({\mathcal {T}}\) serves probabilistic and can be used as a Bayesian network during inference.

In order to achieve that, we postulate to use of variational amortized posterior:

$$\begin{aligned} q(\varvec{\theta }_{\mathcal {T}} \mid {\mathcal {S}}_i, \varvec{\theta }_{H}) = {\mathcal {N}}(\varvec{\theta }_{\mathcal {T}} \mid \mu _{\varvec{\theta }_{H}}({\mathcal {S}}_i), \sigma _{\varvec{\theta }_{H}}({\mathcal {S}}_i)), \end{aligned}$$

where the Gaussian parameters \(\mu ({\mathcal {S}}_i)\) and \(\sigma ({\mathcal {S}}_i)\) are returned by hypernetwork \(H_{\varvec{\theta }_{H}}\) for a given support set \({\mathcal {S}}_i\), i.e., \((\mu ({\mathcal {S}}_i), \sigma ({\mathcal {S}}_i)) = H_{\varvec{\theta }_{H}}( {\mathcal {S}}_i, \varvec{\theta }_{H} )\). Compared to the basic HyperShot model (option 1 Fig. 1) that predicts deterministic target weight, the proposed extension delivers the distribution over parameters for a given support set option 2 Fig. 1 ).

We train the BayesHyperShot using the procedure given by Algorithm 1 with the modified training objective, given by:

$$\begin{aligned} {\mathcal {L}}_{B} = \sum _{{\mathcal {T}}_i \in {\mathcal {D}}} \left[ \frac{1}{P} \sum _{p=1}^{P} \left[ l_{{\mathcal {T}}_i}( \varvec{\theta }_{{\mathcal {T}}_i}^p) - \gamma KL(q(\varvec{\theta }_{{\mathcal {T}}_{i}}^p \mid {\mathcal {S}}_i, \varvec{\theta }_{H}) \mid {\mathcal {N}} (0, I) ) \right] \right] , \end{aligned}$$

where \(\varvec{\theta }_{{\mathcal {T}}_{i}}^1,\dots ,\varvec{\theta }_{{\mathcal {T}}_{i}}^P \sim q(\varvec{\theta }_{\mathcal {T}} \mid {\mathcal {S}}_i, \varvec{\theta }_{H})\) are the target parameters sampled by variational posterior modeled using the hypernetwork, \( l_{{\mathcal {T}}_i}( \varvec{\theta }_{{\mathcal {T}}_i}^p)\) is cross-entropy loss function calculated using target model with weights \(\varvec{\theta }_{{\mathcal {T}}_{i}}^p\) and \(KL(\cdot ,\cdot )\) is Kullback–Leibler divergence between the posterior and standard Gaussian. In order to stabilize the training procedure, we use hyperparameter \(\gamma \) that controls the trade-off between cross-entropy loss and a regularization term. During training, we apply an annealing scheme [13], for which \(\gamma \) grows from zero to a fixed constant during training. The final value \(\gamma _\text {max}\) is a hyperparameter of the model.

During the inference stage, we sample the set of target parameters \(\varvec{\theta }_{{\mathcal {T}}_{i}}^1,\dots ,\varvec{\theta }_{{\mathcal {T}}_{i}}^P \sim q(\varvec{\theta }_{\mathcal {T}} \mid {\mathcal {S}}_i, \varvec{\theta }_{H})\). The final prediction is made simply by averaging among sampled target models,\(\frac{1}{P} \sum _{p=1}^{P} p({\textbf{y}} \mid {\mathcal {S}}_{*}, {\textbf{x}},\varvec{\hat{\theta }}_H, \varvec{\hat{\theta }}_k, \varvec{\hat{\theta }}_E, \varvec{\theta }_{{\mathcal {T}}_{i}}^p)\).

3 Related work

In recent years, various meta-learning methods [14,15,16] have been proposed to tackle the problem of few-shot learning. The various meta-learning architectures for few-shot learning can be roughly categorized into several groups, which we below divide into non-Bayesian and Bayesian families of approaches.

3.1 Non-Bayesian approaches

In non-Bayesian few-shot learning, the goal is typically to learn a set of parameters that can be used to classify new examples based on a limited amount of training data.

Transfer learning [17] is a simple yet effective baseline procedure for few-shot learning which consists of pre-training the neural network and a classifier on all of the classes available during meta-training. During meta-validation, the classifier is then fine-tuned to the novel tasks. In [10], the authors proposed Baseline++, an extension of this idea that uses cosine distance between the examples.

Metric-based methods meta-learn a deep representation with a metric in feature space, such that distance between examples from the support and query set with the same class have a small distance in such space. Some of the earliest works exploring this notion are matching networks [18] and prototypical networks [19], which form prototypes based on embeddings of the examples from the support set in the learned feature space and classify the query set based on the distance to those prototypes. Numerous subsequent works aim to improve the expressiveness of the prototypes through various techniques. Oreshkin et al. [20] achieve this by conditioning the network on specific tasks, thus making the learned space task-dependent. Hu  [21] transform embeddings of support and query examples in the feature space to make their distributions closer to Gaussian. Sung et al. [22] propose Relation Nets, which learn the metric function instead of using a fixed one, such as Euclidean or cosine distance.

Optimization-based methods follow the idea of an optimization process over support set within the meta-learning framework like MetaOptNet [23], Meta-SGD [24] or model-agnostic meta-learning (MAML) [25] and its extensions [26,27,28,29,30,31]. Those techniques aim to train general models, which can adapt their parameters to the support set at hand in a small number of gradient steps. Similar to such techniques, HyperShot (the classical realization of our framework) also aims to produce task-specific models but utilizes a hypernetwork instead of optimization to achieve that goal.

Hypernetworks-based methods [4] have been proposed as a solution to few-shot learning problems in a number of works but have not been researched as widely as the approaches mentioned above. Multiple works proposed various variations of hypernetworks that predict a shallow classifier’s parameters given the support examples [5, 32, 33]. More recently, [7,8,9] explored generating all of the parameters of the target network with a transformer-based hypernetwork, but found that for larger target networks, it is sufficient to generate only the parameters of the final classification layer. A particularly effective approach is to use transformer-based hypernetworks as set-to-set functions which make the generated classifier more discriminative [6]. A key characteristic of the above approaches is that during inference, the hypernetwork predicts weights responsible for classifying each class independently, based solely on the examples of that class from the support set. This property makes such solutions agnostic to the number of classes in a task, useful in practical applications. However, it also means that the hypernetwork does not take advantage of the inter-class differences in the task at hand.

In contrast, models in our framework (in particular HyperShot) exploit those differences by utilizing kernels, which helps improve its performance.

3.2 Bayesian approaches

There are many few-shot learning methods that utilizes Bayesian approach to estimate the parameters of a model. We grouped them in three categories:

Bayesian optimization-based methods reformulate MAML as a hierarchical Bayesian model [34,35,36,37,38,39]. Contrary to this group of methods, the BayesHyperShot does not utilize a bi-level optimization scheme, such as MAML. Moreover, we look at the adaptation of the target network’s weights not at the optimization process itself.

Probabilistic weight generation methods focus on predicting a distribution over the parameters suitable for the given task [40]. Similarly, our BayesHyperShot also predicts the probability over the parameters of the target network performing the few-shot classification. The key difference is that in BayesHyperShot, the target network combines such an approach with a kernel mechanism.

Gaussian processes-based methods [41] possess many properties useful in few-shot learning, such as natural robustness to the limited amounts of data and the ability to estimate uncertainty. When combined with meta-learned deep kernels, In [1], Gaussian processes were demonstrated to be a suitable tool for few-shot regression and classification, dubbed deep kernel transfer (DKT). The assumption that such a universal deep kernel has enough data to generalize well to unseen tasks has been challenged in subsequent works. [3] introduced a technique of learning dense Gaussian processes by inducing variables. This approach achieves substantial performance improvement over the alternative methods. Similarly, Bayesian version of our model (BayesHyperShot) also depends on learning a model that estimates task-specific functions’ parameters. However, BayesHyperShot employs a hypernetwork instead of a Gaussian process to achieve that goal.

4 Experiments

In the typical few-shot learning setting, making a valuable and fair comparison between proposed models is often complicated because of the existence of significant differences in architectures and implementations of known methods. In order to limit the influence of the deeper backbone (feature extractor) architectures, we follow the unified procedure proposed by [10].

In this section, we describe the experimental analysis and performance of the proposed methods—HyperShot and BayesHyperShot—in a large variety of few-shot benchmarks. Specifically, we consider both classification (see Sect. 4.1) and cross-domain adaptation (see Sect. 4.2) tasks. Whereas the classification problems are focused on the most typical few-shot applications, the latter cross-domain benchmarks check the ability of the models to adapt to out-of-distribution tasks. We compare the classical realization of our framework (HyperShot) against the non-Bayesian approaches and the Bayesian version (BayesHyperShot) against Bayesian approaches only. We limit our comparison to models designed for the standard inductive few-shot setting. Notably, we exclude models which utilize anything besides the support set for fitting to the task at hand, such as methods that are transductive [42] or which utilize additional unlabeled data [43]. Additionally, in Sect. 4.3 we conduct an experiment showing the uncertainty estimation via BayesHyperShot. Finally, we perform an ablation study of the possible adaptation procedures of HyperShot to few-shot scenarios, as well as architectural choices—presented in Sect. 4.4.

Table 1 Classification accuracy results for the non-Bayesian approaches

In all of the reported experiments, the tasks consist of 5 classes (5 ways) and 1 or 5 support examples (1 or 5 shots). Unless indicated otherwise, all compared models use a known and widely utilized backbone consisting of four convolutional layers (each consisting of a 2D convolution, a batch-norm layer, and a ReLU nonlinearity; each layer consists of 64 channels) [10] and have been trained from scratch.

We report the performance of two versions of our general framework—classical (HyperShot) and Bayesian (BayesHyperShot). Note that each of them has two variants:

  • HyperShot/BayesHyperShot–models generated by the hypernetworks for each task.

  • HyperShot/BayesHyperShot + adaptation–models generated by hypernetworks adapted to the support examples of each task for 10 training steps.Footnote 1

In all cases, we observe a modest performance boost thanks to adapting the hypernetwork. Comprehensive details and hyperparameters for each training procedure are reported in Appendices B and C.

4.1 Classification

Firstly, we consider a classical few-shot learning scenario, where all the classification tasks (both training and inference) come from the same dataset. The main aim of the proposed classification experiments is to find the ability of the few-shot models to adapt to never-seen tasks from the same data distribution.

We benchmark the performance of our models and other methods on two challenging and widely considered datasets: Caltech-USCD Birds (CUB) [44] and mini-ImageNet [45]. The following experiments are in the most popular setting, 5 ways, consisting of 5 random classes. In all experiments, the query set of each task consists of 16 samples for each class (80 in total).

HyperShot We start with a non-Bayesian perspective and compare HyperShot to a vast pool of state-of-the-art algorithms, including the canonical methods (like matching networks [18], prototypical networks [19], MAML [25], and its extensions) as well as the recently popular: Unicorn-MAML [29] PAMELA [46].

We consider the more challenging one-shot classification task, as well as the five-shot setting and report the results in Table 1.

In the one-shot scenario, HyperShot achieves the second-best accuracy in the CUB dataset with adapting procedure (\(66.13\%\) with adapting, \(65.27\%\) without) and performs better than any other model, except for FEAT [6] (\(68.87 \%\)). In the mini-ImageNet dataset, our approach is among the top approaches (\(53.18\%\)), slightly losing with FEAT [6] (\(55.15\%\)). Considering the five-shot scenario, HyperShot is the second-best model achieving \(80.07\%\) in the CUB dataset and \(69.62\%\) in the mini-ImageNet, whereas the best model, FEAT [6], achieves \(82.90\%\) and \(71.61\%\) on the mentioned datasets, respectively.

The obtained results clearly show that HyperShot achieves results comparable to state-of-the-art non-Bayesian models on the standard set of few-shot classification settings.

BayesHyperShot Then, we compare the Bayesian version of our general framework—BayesHyperShot against the state-of-the-art Bayesian methods. These Bayesian models are mostly built upon the Gaussian Processes framework (like DKT [1]). We consider the more challenging one-shot classification task, as well as the five-shot setting and report the results in Table 2. In the one-shot scenario, BayesHyperShot achieves the third-best accuracy in the CUB dataset despite the adapting procedure (\(66.30\%\) in both cases) and performs similarly to the best model—BayesHMAML [36] (respectively, \(66.92 \%\) with adaptation and \(66.57 \%\) without). In the mini-ImageNet dataset, our Bayesian approach is among the top methods (\(51.11\%\)), slightly losing with Bayesian MAML [34] (\(53.80\%\)) and BayesHMAML (\(52.69\%\)).

Considering the five-shot scenario, BayesHyperShot is the best model achieving \(80.60\%\) in the CUB dataset and \(67.21\%\) in the mini-ImageNet, whereas the most significant competitor, BayesHMAML [36] achieves \(80.47\%\) and \(68.24\%\) on the mentioned datasets, respectively.

According to the results, the performance of BayesHyperShot is comparable to or better than other state-of-the-art Bayesian models on the standard set of few-shot classification settings.

Table 2 Classification accuracy results for the Bayesian approaches
Table 3 Classification accuracy results for the non-Bayesian approaches

4.2 Cross-domain adaptation

In the cross-domain adaptation setting, the models are evaluated on tasks coming from a different distribution than the one they had been trained on. Therefore, such a task is more challenging than standard classification and is a plausible indicator of a model’s ability to generalize. In order to benchmark the performance of our framework in cross-domain adaptation, we merge data from two datasets so that the training fold is drawn from the first dataset and validation and testing fold—from another one. Specifically, we test HyperShot on two cross-domain classification tasks:

mini-ImageNet \(\rightarrow \) CUB (model trained on mini-ImageNet and evaluated on CUB) and Omniglot \(\rightarrow \) EMNIST in the one-shot and five-shot settings. Similarly to the previous experiment, we treated Bayesian and non-Bayesian approaches separately.

HyperShot We start with the non-Bayesian approaches and report the results in Table 3. In every setting, HyperShot achieves the highest accuracy and, as such, is much better than any other non-Bayesian approach. We note that just like in the case of regular classification, adapting the hypernetwork on the individual tasks consistently improves its performance.

BayesHyperShot The results among the Bayesian methods are reported in Table 4. We observer that in every setting BayesHyperShot is comparable but slightly worse than the best models. The difference is usually between 1 and 3 percent points, making the BayesHyperShot among three or four best Bayesian methods. It is worth noting that in this setting, performing the adaptation procedure on BayesHyperShot could result in worse performance than not adapting at all.

Table 4 Classification accuracy results for the Bayesian approaches
Fig. 3
figure 3

We visualize box plots of distributions of activations produced by the sampled models for the four different sets of support/query/out-of-distribution images. As we can see, BayesHyperShot always yields similar predictions for elements from both the support and query sets. On the other hand, we observe a high variance of activations when processing out-of-distribution images, which indicates the high uncertainty of the model in such cases

4.3 Uncertainty quantification

The most important feature of the Bayesian realization of our general framework is that we have the full insight into the model’s uncertainty. In fact, due to the BayesHyperShot’s probabilistic construction, we can quantify the level of model’s certainty and answer the question if the model is sure about the specific prediction. Moreover, this property, enable us to say if the given sample comes from the known distribution—the distribution related to the current few-shot task or if it is out-of-distribution example.

Table 5 Classification accuracy results for HyperShot in the five-shot setting with two variants of the support embeddings aggregation

In the following experiment, we present the uncertainty quantification of the BayesHyperShot model. We consider a model trained on the specific dataset and setting (here, it was Omniglot \(\rightarrow \) EMNIST). Then, we have taken three different types of samples to quantify the level of uncertainty of our model’s predictions. Specifically, we test the images from:

  • Support set;

  • Query set;

  • Coming from the same dataset, but with class not present in the support set (out of distribution).

We observe that the examples coming from the support set or query set are classified with very high certainty. The model classifies such images as from a given class (label 1) or not (label 0). Note that label 1 appears for only one class, and label 0 for the rest.

However, the most important is the result obtained for the samples of classes not present in the support set. We observe that the model is uncertain about the classification—it usually gives nonzero probabilities for each of the possible classes. The present increase in the entropy of probabilities allows us to classify such examples as out-of-distribution samples. The results are presented in Fig. 3.

4.4 Ablation study

In order to investigate different architectural choices in adapting our framework to the specific task, we provide a comprehensive ablation study. In this ablation study, we consider the HyperShot model only for better clarity and apply our findings when selecting the parameters of BayesHyperShot. We focused mostly on the four major components of the HyperShot design, i.e., the method of processing multiple support examples per class, the number of neck layers, the number of head layers and the size of the hidden layers, presented in Tables 5, 6 and 7. In the case of the experiments focusing on aggregating the number of support examples in the five-way five-shot setting, we perform the benchmarks on CUB and mini-ImageNet, using a four-layer convolutional backbone. In the remaining experiments, we tested HyperShot on the CUB dataset in the five-way one-shot setting with ResNet-10 backbone.

Aggregating support examples in the five-shot setting

In HyperShot, the hypernetwork generates the weights of the information about the support examples, expressed through the support–support kernel matrix. In the case of five-way one-shot classification, each task consists of 5 support examples, and therefore, the size of the kernel matrix is \((5 \times 5)\), and the input size of the hypernetwork is 25. However, with a growing number of the support examples, increasing the size of the kernel matrix would be impractical and could lead to overparametrization of the hypernetwork.

Since hypernetworks are known to be sensitive to large input sizes [4], we consider a way to maintain a constant input size of HyperShot, independent of the number of support examples of each class by using means of support embeddings of each class for kernel calculation, instead of individual embeddings. Prior works suggest that when there are multiple examples of a class, the averaged embedding of such class represents it sufficiently in the embedding space [19].

To verify this approach, in the five-shot setting, we train HyperShot with two variants of calculating the inputs to the kernel matrix:

  • Fine-grained—utilizing a hypernetwork that takes as an input a kernel matrix between each of the embeddings of the individual support examples. This kernel matrix has a shape of \((25 \times 25)\).

  • Averaged—utilizing a hypernetwork where the kernel matrix is calculated between the means of embeddings of each class. The kernel matrix in this approach has a shape of \((5 \times 5)\).

We benchmark both variants of HyperShot on the five-shot classification task on CUB and mini-ImageNet datasets, as well as the task of cross-domain Omniglot \(\rightarrow \) EMNIST classification. We report the accuracies in Table 5. It is evident that averaging the embeddings before calculating the kernel matrix yields superior results.

Hidden size: Firstly, as presented in Table 6, we compare different sizes of hidden layers. The results agree with the intuition that the wider the layers, the better the results. However, we also observe that some hidden sizes (e.g., 8188) could be too large to learn effectively. Because of that, we propose to use hidden sizes of 2048 or 4096 as the standard.

Table 6 Comparison between various hidden sizes in the HyperShot’s layers
Table 7 Comparison between various HyperShot’s architectures (different number of neck layers and head layers)

Neck and head layers: Then, we compared the influence of the number of neck layers and head layers of HyperShot for the achieved results, as presented in Table 7. We observed that the most critical is the number of head layers—specific for each target network’s layers. Because of that, we propose using the standard number of 3 head layers and using various neck layers—tuning them to the specific task.

5 Conclusion

In this work, we introduced a novel general framework that uses kernel methods combined with hypernetworks. Our method directly relies on the kernel-based representations of the support examples and a hypernetwork paradigm to create the query set’s classification module. We concentrate on relations between embeddings of the support examples instead of direct feature values. Thanks to this approach, models that realize our framework can adapt to highly different tasks.

Specifically, in this paper, we propose a classical (HyperShot) and a Bayesian (BayesHyperShot) realization of our framework. We evaluate both models on various one-shot and few-shot image classification tasks. Both HyperShot and BayesHyperShot demonstrate high accuracy in all tasks, performing comparably or better to state-of-the-art solutions. Moreover, both models have a strong ability to generalize, as evidenced by their performances on cross-domain classification tasks.

Finally, we demonstrate the ability of the Bayesian version of our framework to properly quantify the uncertainty of the model’s prediction. As such, BayesHyperShot is able to recognize an out-of-distribution samples and return the level of certainty of the classification.