Abstract
Fewshot models aim at making predictions using a minimal number of labeled examples from a given task. The main challenge in this area is the oneshot setting, where only one element represents each class. We propose the general framework for fewshot learning via kernel HyperNetworks—the fusion of kernels and hypernetwork paradigm. Firstly, we introduce the classical realization of this framework, dubbed HyperShot. Compared to reference approaches that apply a gradientbased adjustment of the parameters, our models aim to switch the classification module parameters depending on the task’s embedding. In practice, we utilize a hypernetwork, which takes the aggregated information from support data and returns the classifier’s parameters handcrafted for the considered problem. Moreover, we introduce the kernelbased representation of the support examples delivered to hypernetwork to create the parameters of the classification module. Consequently, we rely on relations between the support examples’ embeddings instead of the backbone models’ direct feature values. Thanks to this approach, our model can adapt to highly different tasks. While such a method obtains very good results, it is limited by typical problems such as poorly quantified uncertainty due to limited data size. We further show that incorporating Bayesian neural networks into our general framework, an approach we call BayesHyperShot, solves this issue.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Current artificial intelligence techniques cannot rapidly generalize from a few examples. This common inability stems from the fact that most deep neural networks must be trained on largescale data. In contrast, humans can learn new tasks quickly by utilizing what they learned in the past. Fewshot learning models try to fill this gap by learning how to learn from a limited number of examples. Fewshot learning is the problem of making predictions based on a few labeled examples. The goal of fewshot learning is not to recognize a fixed set of labels but to quickly adapt to new tasks with a small amount of training data. After training, the model can classify new data using only a few training examples.
Two novel fewshot learning techniques have recently emerged. The first is based on the kernel methods and Gaussian processes [1,2,3]. The universal deep kernel has enough data to generalize well to unseen tasks without overfitting. The second technique makes use of the Hypernetworks [4,5,6,7,8,9], which allow to aggregate information from the support set and produce dedicated network weights for new tasks.
The above approaches give promising results but also have some limitations. Kernelbased methods are not flexible enough, since they use Gaussian processes on top of the models. Moreover, it is not trivial to use Gaussian processes for classification tasks. On the other hand, Hypernetworks must aggregate information from the support set, and it is hard to model the relation between classes as opposed to classical feature extraction.
This paper introduces a general framework that combines the Hypernetworks paradigm with kernel methods to realize a new strategy that mimics the human way of learning. First, we examine the entire support set and extract the information in order to distinguish objects of each class. Then, based on the relations between their features, we create the decision rules.
Kernel methods realize the first part of the process. For each of the fewshot tasks, we extract the features from the support set through the backbone architecture and calculate kernel values between them. Then we use a Hypernetwork architecture [3, 4]—a neural network that takes kernel representation and produces decision rules in the form of a classifier. In our framework, the Hypernetwork aggregates the information from the support set and produces weights or adaptation parameters of the target model dedicated to the specific task, classifying the query set.
The models we propose inherit the flexibility from Hypernetworks and the ability to learn the relation between objects from kernelbased methods.
We consider two alternative approaches to the realization of our framework—the classical one, which we dubbed HyperShot, and the Bayesian version called BayesHyperShot. Firstly, we started with the classical approach.
We perform an extensive experimental study of HyperShot by benchmarking it on various oneshot and fewshot image classification tasks. We find that HyperShot demonstrates high accuracy in all tasks, performing comparably or better than the other recently proposed methods. Moreover, HyperShot shows a strong ability to generalize, as evidenced by its performance on crossdomain classification tasks.
Unfortunately HyperShot, similar to other fewshot algorithms, suffers from limited data size, which may result in drawbacks such as poorly quantified uncertainty. To solve this problem, we extend our method by incorporating Bayesian neural network and Hypernetwork paradigms, which results in a Bayesian version of our general framework—BayesHyperShot. In practice, hypernetwork produces parameters of probability distribution on weights. In this paper, we consider Gaussian prior, but our framework can be used for modeling even more complex priors.
The contributions of this work are fourfold:

In this paper, we propose a general framework that realizes the learn how to learn paradigm by modeling learning rules which are not based on gradient optimization and can produce completely different decision strategies.

We propose a new approach to solve the fewshot learning problem by aggregating information from the support set by kernel methods and directly producing weights from the neural network dedicated to the query set.

We propose HyperShot model, which combines the Hypernetworks paradigm with kernel methods classically, to produce the weights dedicated for each task.

We show that our model can be generalized to Bayesian version BayesHyperShot, which allows for solving problems with poorly quantified uncertainty.
2 Hypernetworks for fewshot learning–a general framework
In this section, we present our general framework for utilizing the Hypernetworks and kernelbased methods for fewshot learning. In the beginning, we give a quick recap of the necessary background. Then, we introduce our framework by presenting the HyperShot. Finally, we show how to incorporate the Bayesian approach in this general framework and present the BayesHyperShot.
2.1 Background
Fewshot learning The terminology describing the fewshot learning setup is dispersive due to the colliding definitions used in the literature. For a unified taxonomy, we refer the reader to [10, 11]. Here, we use the nomenclature derived from the metalearning literature, which is the most prevalent at the time of writing. Let:
be a support set containing input–output pairs, with L examples with the equal class distribution. In the oneshot scenario, each class is represented by a single example, and \(L=K\), where K is the number of the considered classes in the given task. For fewshot scenarios, each class usually has from 2 to 5 representatives in the support set \({\mathcal {S}}\). Let:
be a query set (sometimes referred to in the literature as a target set), with M examples, where M is typically one order of magnitude greater than K. For ease of notation, the support and query sets are grouped in a task \({\mathcal {T}} = \{{\mathcal {S}}, {\mathcal {Q}} \}\). During the training stage, the models for fewshot applications are fed by randomly selected examples from training set \({\mathcal {D}} = \{{\mathcal {T}}_n\}^N_{n=1}\), defined as a collection of such tasks.
During the inference stage, we consider task \({\mathcal {T}}_{*} = \{{\mathcal {S}}_{*}, {\mathcal {X}}_{*}\}\), where \({\mathcal {S}}_{*}\) is a support set with the known class values for a given task, and \({\mathcal {X}}_{*}\) is a set of query (unlabeled) inputs. The goal is to predict the class labels for query inputs \({\textbf{x}} \in {\mathcal {X}}_*\), assuming support set \({\mathcal {S}}_{*}\) and using the model trained on \({\mathcal {D}}\).
Hypernetwork In the canonical work [4], hypernetworks are defined as neural models that generate weights for a separate target network solving a specific task. The authors aim to reduce the number of trainable parameters by designing a hypernetwork with a smaller number of parameters than the target network. Making an analogy between hypernetworks and generative models, the authors of [12] use this mechanism to generate a diverse set of target networks approximating the same function.
2.2 HyperShot overview
We introduce HyperShot—a model that utilizes hypernetworks for fewshot problems. The main idea of the proposed approach is to predict the values of the parameters for a classification network that makes predictions on the query images given the information extracted from support examples for a given task. Thanks to this approach, we can switch the classifier’s parameters between completely different tasks based on the support set. The information about the current task is extracted from the support set using a parameterized kernel function that operates on embedding space. Thanks to this approach, we use relations among the support examples instead of taking the direct values of the embedding values as an input to the hypernetwork. Consequently, this approach is robust to the embedding values for new tasks far from the feature regions observed during training. The classification of the query image is also performed using the kernel values calculated with respect to the support set.
The architecture of HyperShot is provided in Fig. 1. We aim to predict the class distribution \(p({\textbf{y}} \mid S,{\textbf{x}}),\) given a query image \({\textbf{x}}\) and set of support examples \(S = \{ ({\textbf{x}}_l, y_l) \}_{l=1}^K.\) First, all images from the support set are grouped by their corresponding class values. Next, each of the images \({\textbf{x}}_l\) from the support set is transformed using encoding network \(E(\cdot )\), which creates lowdimensional representations of the images, \(E({\textbf{x}}_l)={\textbf{z}}_l\). The constructed embeddings are sorted according to class labels and stored in the matrix \({\textbf{Z}}_S=[{\textbf{z}}_{\pi (1)}, \dots , {\textbf{z}}_{\pi (K)}]^\textrm{T}\), where \(\pi (\cdot )\) is the bijective function, that satisfies \(y_{\pi (l)} \le y_{\pi (k)}\) for \(l \le k\).
In the next step, we calculate the kernel matrix \({\textbf{K}}_{S, S}\), for vector pairs stored in rows of \({\textbf{Z}}_S\). To achieve this, we use the parametrized kernel function \(k(\cdot , \cdot )\), and calculate \(k_{i,j}\) element of matrix \({\textbf{K}}_{S, S}\) in the following way:
The kernel matrix \({\textbf{K}}_{S, S}\) represents the extracted information about the relations between support examples for a given task. The matrix \({\textbf{K}}_{S, S}\) is further reshaped to the vector format and delivered to the input of the hypernetwork \(H(\cdot )\). The role of the hypernetwork is to provide the parameters \(\varvec{\theta }_T\) of target model \(T(\cdot )\) responsible for the classification of the query object. Thanks to that approach, we can switch between the parameters for entirely different tasks without moving via the gradientcontrolled trajectory, like in some reference approaches like MAML.
We base the architecture of the Hypernetwork on a multilayer perceptron (MLP). Specifically, the HyperNetwork first processes the kernel matrix \({\textbf{K}}_{S, S}\) through an MLP (dubbed the “neck”), whose output is then processed by “heads”—separate MLPs dedicated to producing the weights of each target network layer—see Fig. 4 for a visualization. We summarize the parameters of Hypernetworks and target networks used in our experiments in Table 8.
The query image \({\textbf{x}}\) is classified in the following manner. First, the input image is transformed to lowdimensional feature representation \(z_{{\textbf{x}}}\) by encoder \(E({\textbf{x}})\). Further, the kernel vector \({\textbf{k}}_{{\textbf{x}},S}\) between the query embedding and sorted support vectors \({\textbf{Z}}_S\) is calculated in the following way:
The vector \({\textbf{k}}_{{\textbf{x}}, S}\) is further provided on the input of target model \(T(\cdot )\) that is using the parameters \(\varvec{\theta }_{T}\) returned by hypernetwork \(H(\cdot )\). The target model returns the probability distribution \(p({\textbf{y}} \mid S, {\textbf{x}})\) for each class considered in the task.
The function \(\pi (\cdot )\) enforces some ordering of the input delivered to \(T(\cdot )\). Practically, any other permutation of the classes for the input vector \({\textbf{k}}_{{\textbf{x}}, S}\). In such a case, the same permutation should be applied to rows and columns of \({\textbf{K}}_{S,S}\). As a consequence, the hypernetwork is able to produce the dedicated target parameters for each of the possible permutations. Although this approach does not guarantee the permutation invariance for reallife scenarios, thanks to dedicated parameters for any ordering of the input, it should be satisfied for major cases.
2.3 Kernel function
One of the key components of our approach is a kernel function \(k(\cdot , \cdot )\). In this work, we consider the dot product of the transformed vectors given by:
where \({\textbf{f}}(\cdot )\) can be a parametrized transformation function, represented by MLP model, or simply an identity operation, \({\textbf{f}}({\textbf{z}})={\textbf{z}}\). In Euclidean space, this criterion can be expressed as \(k({\textbf{z}}_1, {\textbf{z}}_2)={\textbf{f}}({\textbf{z}}_1) \cdot {\textbf{f}}({\textbf{z}}_2) \cos {\alpha }\), where \(\alpha \) is an angle between vectors \({\textbf{f}}({\textbf{z}}_1)\) and \({\textbf{f}}({\textbf{z}}_2)\). The main feature of this function is that it considers the vectors’ norms, which can be problematic for some tasks that are outliers regarding the representations created by \({\textbf{f}}(\cdot )\). Therefore, we consider in our experiments also the cosine kernel function given by:
that represents the normalized version dot product. Considering the geometrical representation, \(k_{c}({\textbf{z}}_1, {\textbf{z}}_2)\) can be expressed as \(\cos {\alpha }\) (see the example given by Fig. 2). The support set is represented by two examples from different classes, \({\textbf{f}}_1\) and \({\textbf{f}}_2\). The target model parameters \(\varvec{\theta }_{T}\) are created based only on the cosine value of the angle between vectors \({\textbf{f}}_1\) and \({\textbf{f}}_2\). During the classification stage, the query example is represented by \({\textbf{f}}_{{\textbf{x}}}\), and the classification is applied on the cosine values of angles between \({\textbf{f}}_{{\textbf{x}}}\) and \({\textbf{f}}_1\), and \({\textbf{f}}_{{\textbf{x}}}\) and \({\textbf{f}}_2\), respectively.
2.4 Training and prediction
The training procedure assumes the following parametrization of the model components. The encoder \(E:=E_{\varvec{\theta }_E}\) is parametrized by \(\varvec{\theta _E}\), the hypernetwork \(H=H_{\varvec{\theta }_H}\) by \(\varvec{\theta }_H\) and the kernel function k by \(\varvec{\theta }_k\). We assume that training set \({\mathcal {D}}\) is represented by tasks \({\mathcal {T}}_i\) composed of support \({\mathcal {S}}_i\) and query \({\mathcal {Q}}_i\) examples. The training is performed by optimizing the crossentropy criterion:
where \(({\textbf{x}}_{i,n}, {\textbf{y}}_{i, n})\) are examples from query set \({\mathcal {Q}}_i\), where \(Q_i=\{({\textbf{x}}_{i,m}, {\textbf{y}}_{i, m})\}_{m=1}^M\). The distribution for currently considered classes \(p({\textbf{y}} \mid {\mathcal {S}}, {\textbf{x}})\) is returned by target network T of HyperShot. During the training, we jointly optimize the parameters \(\varvec{\theta }_H\), \(\varvec{\theta }_k\) and \(\varvec{\theta }_E\), minimizing the L loss.
During the inference stage, we consider the task \({\mathcal {T}}_*\), composed of a set of labeled support examples \({\mathcal {S}}_*\) and a set of unlabeled query examples represented by input values \({\mathcal {X}}_*\) that the model should classify. We can simply take the probability values \(p({\textbf{y}} \mid {\mathcal {S}}_{*}, {\textbf{x}})\) assuming the given support set \({\mathcal {S}}_*\) and single query observation \({\textbf{x}}\) from \({\mathcal {X}}_*\), using the model with trained parameters \(\varvec{\theta }_H\), \(\varvec{\theta }_k\), and \(\varvec{\theta }_E\). However, we observe that slightly better results are obtained while adapting the model’s parameters on the considered task. We do not have access to labels for query examples. Therefore, we imitate the query set for this task simply by taking support examples and creating the adaptation task \({\mathcal {T}}_i=\{{\mathcal {S}}_*,{\mathcal {S}}_*\}\) and updating the parameters of the model using several gradient iterations. The detailed presentation o training and prediction procedures is provided by Algorithm 1.
2.5 Adaptation to fewshot scenarios
The proposed approach uses the ordering function \(\pi (\cdot )\) that keeps the consistency between support kernel matrix \({\textbf{K}}_{S,S}\) and the vector of kernel values \({\textbf{k}}_{{\textbf{x}}, S}\) for query example \({\textbf{x}}\). For fewshot scenarios, each class has more than one representative in the support set. As a consequence, there are various possibilities to order the feature vectors in the support set inside the considered class. To eliminate this issue, we follow [8] and propose to apply the aggregation function to the embeddings \({\textbf{z}}\) considering the support examples from the same class. Thanks to this approach, the kernel matrix is calculated based on the aggregated values of the latent space of encoding network E, making our approach independent of the ordering among the embeddings from the same class. In experimental studies, we examine the quality of mean aggregation operation (averaged) against simple classwise concatenation of the embeddings (finegrained) in ablation studies.
2.6 Extension to Bayesian setting—BayesHyperShot
Finally, we introduce the Bayesian extension of HyperShot, where the distribution over the parameters is represented by Gaussian prior:
Thanks to that assumption, the target model \({\mathcal {T}}\) serves probabilistic and can be used as a Bayesian network during inference.
In order to achieve that, we postulate to use of variational amortized posterior:
where the Gaussian parameters \(\mu ({\mathcal {S}}_i)\) and \(\sigma ({\mathcal {S}}_i)\) are returned by hypernetwork \(H_{\varvec{\theta }_{H}}\) for a given support set \({\mathcal {S}}_i\), i.e., \((\mu ({\mathcal {S}}_i), \sigma ({\mathcal {S}}_i)) = H_{\varvec{\theta }_{H}}( {\mathcal {S}}_i, \varvec{\theta }_{H} )\). Compared to the basic HyperShot model (option 1 Fig. 1) that predicts deterministic target weight, the proposed extension delivers the distribution over parameters for a given support set option 2 Fig. 1 ).
We train the BayesHyperShot using the procedure given by Algorithm 1 with the modified training objective, given by:
where \(\varvec{\theta }_{{\mathcal {T}}_{i}}^1,\dots ,\varvec{\theta }_{{\mathcal {T}}_{i}}^P \sim q(\varvec{\theta }_{\mathcal {T}} \mid {\mathcal {S}}_i, \varvec{\theta }_{H})\) are the target parameters sampled by variational posterior modeled using the hypernetwork, \( l_{{\mathcal {T}}_i}( \varvec{\theta }_{{\mathcal {T}}_i}^p)\) is crossentropy loss function calculated using target model with weights \(\varvec{\theta }_{{\mathcal {T}}_{i}}^p\) and \(KL(\cdot ,\cdot )\) is Kullback–Leibler divergence between the posterior and standard Gaussian. In order to stabilize the training procedure, we use hyperparameter \(\gamma \) that controls the tradeoff between crossentropy loss and a regularization term. During training, we apply an annealing scheme [13], for which \(\gamma \) grows from zero to a fixed constant during training. The final value \(\gamma _\text {max}\) is a hyperparameter of the model.
During the inference stage, we sample the set of target parameters \(\varvec{\theta }_{{\mathcal {T}}_{i}}^1,\dots ,\varvec{\theta }_{{\mathcal {T}}_{i}}^P \sim q(\varvec{\theta }_{\mathcal {T}} \mid {\mathcal {S}}_i, \varvec{\theta }_{H})\). The final prediction is made simply by averaging among sampled target models,\(\frac{1}{P} \sum _{p=1}^{P} p({\textbf{y}} \mid {\mathcal {S}}_{*}, {\textbf{x}},\varvec{\hat{\theta }}_H, \varvec{\hat{\theta }}_k, \varvec{\hat{\theta }}_E, \varvec{\theta }_{{\mathcal {T}}_{i}}^p)\).
3 Related work
In recent years, various metalearning methods [14,15,16] have been proposed to tackle the problem of fewshot learning. The various metalearning architectures for fewshot learning can be roughly categorized into several groups, which we below divide into nonBayesian and Bayesian families of approaches.
3.1 NonBayesian approaches
In nonBayesian fewshot learning, the goal is typically to learn a set of parameters that can be used to classify new examples based on a limited amount of training data.
Transfer learning [17] is a simple yet effective baseline procedure for fewshot learning which consists of pretraining the neural network and a classifier on all of the classes available during metatraining. During metavalidation, the classifier is then finetuned to the novel tasks. In [10], the authors proposed Baseline++, an extension of this idea that uses cosine distance between the examples.
Metricbased methods metalearn a deep representation with a metric in feature space, such that distance between examples from the support and query set with the same class have a small distance in such space. Some of the earliest works exploring this notion are matching networks [18] and prototypical networks [19], which form prototypes based on embeddings of the examples from the support set in the learned feature space and classify the query set based on the distance to those prototypes. Numerous subsequent works aim to improve the expressiveness of the prototypes through various techniques. Oreshkin et al. [20] achieve this by conditioning the network on specific tasks, thus making the learned space taskdependent. Hu [21] transform embeddings of support and query examples in the feature space to make their distributions closer to Gaussian. Sung et al. [22] propose Relation Nets, which learn the metric function instead of using a fixed one, such as Euclidean or cosine distance.
Optimizationbased methods follow the idea of an optimization process over support set within the metalearning framework like MetaOptNet [23], MetaSGD [24] or modelagnostic metalearning (MAML) [25] and its extensions [26,27,28,29,30,31]. Those techniques aim to train general models, which can adapt their parameters to the support set at hand in a small number of gradient steps. Similar to such techniques, HyperShot (the classical realization of our framework) also aims to produce taskspecific models but utilizes a hypernetwork instead of optimization to achieve that goal.
Hypernetworksbased methods [4] have been proposed as a solution to fewshot learning problems in a number of works but have not been researched as widely as the approaches mentioned above. Multiple works proposed various variations of hypernetworks that predict a shallow classifier’s parameters given the support examples [5, 32, 33]. More recently, [7,8,9] explored generating all of the parameters of the target network with a transformerbased hypernetwork, but found that for larger target networks, it is sufficient to generate only the parameters of the final classification layer. A particularly effective approach is to use transformerbased hypernetworks as settoset functions which make the generated classifier more discriminative [6]. A key characteristic of the above approaches is that during inference, the hypernetwork predicts weights responsible for classifying each class independently, based solely on the examples of that class from the support set. This property makes such solutions agnostic to the number of classes in a task, useful in practical applications. However, it also means that the hypernetwork does not take advantage of the interclass differences in the task at hand.
In contrast, models in our framework (in particular HyperShot) exploit those differences by utilizing kernels, which helps improve its performance.
3.2 Bayesian approaches
There are many fewshot learning methods that utilizes Bayesian approach to estimate the parameters of a model. We grouped them in three categories:
Bayesian optimizationbased methods reformulate MAML as a hierarchical Bayesian model [34,35,36,37,38,39]. Contrary to this group of methods, the BayesHyperShot does not utilize a bilevel optimization scheme, such as MAML. Moreover, we look at the adaptation of the target network’s weights not at the optimization process itself.
Probabilistic weight generation methods focus on predicting a distribution over the parameters suitable for the given task [40]. Similarly, our BayesHyperShot also predicts the probability over the parameters of the target network performing the fewshot classification. The key difference is that in BayesHyperShot, the target network combines such an approach with a kernel mechanism.
Gaussian processesbased methods [41] possess many properties useful in fewshot learning, such as natural robustness to the limited amounts of data and the ability to estimate uncertainty. When combined with metalearned deep kernels, In [1], Gaussian processes were demonstrated to be a suitable tool for fewshot regression and classification, dubbed deep kernel transfer (DKT). The assumption that such a universal deep kernel has enough data to generalize well to unseen tasks has been challenged in subsequent works. [3] introduced a technique of learning dense Gaussian processes by inducing variables. This approach achieves substantial performance improvement over the alternative methods. Similarly, Bayesian version of our model (BayesHyperShot) also depends on learning a model that estimates taskspecific functions’ parameters. However, BayesHyperShot employs a hypernetwork instead of a Gaussian process to achieve that goal.
4 Experiments
In the typical fewshot learning setting, making a valuable and fair comparison between proposed models is often complicated because of the existence of significant differences in architectures and implementations of known methods. In order to limit the influence of the deeper backbone (feature extractor) architectures, we follow the unified procedure proposed by [10].
In this section, we describe the experimental analysis and performance of the proposed methods—HyperShot and BayesHyperShot—in a large variety of fewshot benchmarks. Specifically, we consider both classification (see Sect. 4.1) and crossdomain adaptation (see Sect. 4.2) tasks. Whereas the classification problems are focused on the most typical fewshot applications, the latter crossdomain benchmarks check the ability of the models to adapt to outofdistribution tasks. We compare the classical realization of our framework (HyperShot) against the nonBayesian approaches and the Bayesian version (BayesHyperShot) against Bayesian approaches only. We limit our comparison to models designed for the standard inductive fewshot setting. Notably, we exclude models which utilize anything besides the support set for fitting to the task at hand, such as methods that are transductive [42] or which utilize additional unlabeled data [43]. Additionally, in Sect. 4.3 we conduct an experiment showing the uncertainty estimation via BayesHyperShot. Finally, we perform an ablation study of the possible adaptation procedures of HyperShot to fewshot scenarios, as well as architectural choices—presented in Sect. 4.4.
In all of the reported experiments, the tasks consist of 5 classes (5 ways) and 1 or 5 support examples (1 or 5 shots). Unless indicated otherwise, all compared models use a known and widely utilized backbone consisting of four convolutional layers (each consisting of a 2D convolution, a batchnorm layer, and a ReLU nonlinearity; each layer consists of 64 channels) [10] and have been trained from scratch.
We report the performance of two versions of our general framework—classical (HyperShot) and Bayesian (BayesHyperShot). Note that each of them has two variants:

HyperShot/BayesHyperShot–models generated by the hypernetworks for each task.

HyperShot/BayesHyperShot + adaptation–models generated by hypernetworks adapted to the support examples of each task for 10 training steps.^{Footnote 1}
In all cases, we observe a modest performance boost thanks to adapting the hypernetwork. Comprehensive details and hyperparameters for each training procedure are reported in Appendices B and C.
4.1 Classification
Firstly, we consider a classical fewshot learning scenario, where all the classification tasks (both training and inference) come from the same dataset. The main aim of the proposed classification experiments is to find the ability of the fewshot models to adapt to neverseen tasks from the same data distribution.
We benchmark the performance of our models and other methods on two challenging and widely considered datasets: CaltechUSCD Birds (CUB) [44] and miniImageNet [45]. The following experiments are in the most popular setting, 5 ways, consisting of 5 random classes. In all experiments, the query set of each task consists of 16 samples for each class (80 in total).
HyperShot We start with a nonBayesian perspective and compare HyperShot to a vast pool of stateoftheart algorithms, including the canonical methods (like matching networks [18], prototypical networks [19], MAML [25], and its extensions) as well as the recently popular: UnicornMAML [29] PAMELA [46].
We consider the more challenging oneshot classification task, as well as the fiveshot setting and report the results in Table 1.
In the oneshot scenario, HyperShot achieves the secondbest accuracy in the CUB dataset with adapting procedure (\(66.13\%\) with adapting, \(65.27\%\) without) and performs better than any other model, except for FEAT [6] (\(68.87 \%\)). In the miniImageNet dataset, our approach is among the top approaches (\(53.18\%\)), slightly losing with FEAT [6] (\(55.15\%\)). Considering the fiveshot scenario, HyperShot is the secondbest model achieving \(80.07\%\) in the CUB dataset and \(69.62\%\) in the miniImageNet, whereas the best model, FEAT [6], achieves \(82.90\%\) and \(71.61\%\) on the mentioned datasets, respectively.
The obtained results clearly show that HyperShot achieves results comparable to stateoftheart nonBayesian models on the standard set of fewshot classification settings.
BayesHyperShot Then, we compare the Bayesian version of our general framework—BayesHyperShot against the stateoftheart Bayesian methods. These Bayesian models are mostly built upon the Gaussian Processes framework (like DKT [1]). We consider the more challenging oneshot classification task, as well as the fiveshot setting and report the results in Table 2. In the oneshot scenario, BayesHyperShot achieves the thirdbest accuracy in the CUB dataset despite the adapting procedure (\(66.30\%\) in both cases) and performs similarly to the best model—BayesHMAML [36] (respectively, \(66.92 \%\) with adaptation and \(66.57 \%\) without). In the miniImageNet dataset, our Bayesian approach is among the top methods (\(51.11\%\)), slightly losing with Bayesian MAML [34] (\(53.80\%\)) and BayesHMAML (\(52.69\%\)).
Considering the fiveshot scenario, BayesHyperShot is the best model achieving \(80.60\%\) in the CUB dataset and \(67.21\%\) in the miniImageNet, whereas the most significant competitor, BayesHMAML [36] achieves \(80.47\%\) and \(68.24\%\) on the mentioned datasets, respectively.
According to the results, the performance of BayesHyperShot is comparable to or better than other stateoftheart Bayesian models on the standard set of fewshot classification settings.
4.2 Crossdomain adaptation
In the crossdomain adaptation setting, the models are evaluated on tasks coming from a different distribution than the one they had been trained on. Therefore, such a task is more challenging than standard classification and is a plausible indicator of a model’s ability to generalize. In order to benchmark the performance of our framework in crossdomain adaptation, we merge data from two datasets so that the training fold is drawn from the first dataset and validation and testing fold—from another one. Specifically, we test HyperShot on two crossdomain classification tasks:
miniImageNet \(\rightarrow \) CUB (model trained on miniImageNet and evaluated on CUB) and Omniglot \(\rightarrow \) EMNIST in the oneshot and fiveshot settings. Similarly to the previous experiment, we treated Bayesian and nonBayesian approaches separately.
HyperShot We start with the nonBayesian approaches and report the results in Table 3. In every setting, HyperShot achieves the highest accuracy and, as such, is much better than any other nonBayesian approach. We note that just like in the case of regular classification, adapting the hypernetwork on the individual tasks consistently improves its performance.
BayesHyperShot The results among the Bayesian methods are reported in Table 4. We observer that in every setting BayesHyperShot is comparable but slightly worse than the best models. The difference is usually between 1 and 3 percent points, making the BayesHyperShot among three or four best Bayesian methods. It is worth noting that in this setting, performing the adaptation procedure on BayesHyperShot could result in worse performance than not adapting at all.
4.3 Uncertainty quantification
The most important feature of the Bayesian realization of our general framework is that we have the full insight into the model’s uncertainty. In fact, due to the BayesHyperShot’s probabilistic construction, we can quantify the level of model’s certainty and answer the question if the model is sure about the specific prediction. Moreover, this property, enable us to say if the given sample comes from the known distribution—the distribution related to the current fewshot task or if it is outofdistribution example.
In the following experiment, we present the uncertainty quantification of the BayesHyperShot model. We consider a model trained on the specific dataset and setting (here, it was Omniglot \(\rightarrow \) EMNIST). Then, we have taken three different types of samples to quantify the level of uncertainty of our model’s predictions. Specifically, we test the images from:

Support set;

Query set;

Coming from the same dataset, but with class not present in the support set (out of distribution).
We observe that the examples coming from the support set or query set are classified with very high certainty. The model classifies such images as from a given class (label 1) or not (label 0). Note that label 1 appears for only one class, and label 0 for the rest.
However, the most important is the result obtained for the samples of classes not present in the support set. We observe that the model is uncertain about the classification—it usually gives nonzero probabilities for each of the possible classes. The present increase in the entropy of probabilities allows us to classify such examples as outofdistribution samples. The results are presented in Fig. 3.
4.4 Ablation study
In order to investigate different architectural choices in adapting our framework to the specific task, we provide a comprehensive ablation study. In this ablation study, we consider the HyperShot model only for better clarity and apply our findings when selecting the parameters of BayesHyperShot. We focused mostly on the four major components of the HyperShot design, i.e., the method of processing multiple support examples per class, the number of neck layers, the number of head layers and the size of the hidden layers, presented in Tables 5, 6 and 7. In the case of the experiments focusing on aggregating the number of support examples in the fiveway fiveshot setting, we perform the benchmarks on CUB and miniImageNet, using a fourlayer convolutional backbone. In the remaining experiments, we tested HyperShot on the CUB dataset in the fiveway oneshot setting with ResNet10 backbone.
Aggregating support examples in the fiveshot setting
In HyperShot, the hypernetwork generates the weights of the information about the support examples, expressed through the support–support kernel matrix. In the case of fiveway oneshot classification, each task consists of 5 support examples, and therefore, the size of the kernel matrix is \((5 \times 5)\), and the input size of the hypernetwork is 25. However, with a growing number of the support examples, increasing the size of the kernel matrix would be impractical and could lead to overparametrization of the hypernetwork.
Since hypernetworks are known to be sensitive to large input sizes [4], we consider a way to maintain a constant input size of HyperShot, independent of the number of support examples of each class by using means of support embeddings of each class for kernel calculation, instead of individual embeddings. Prior works suggest that when there are multiple examples of a class, the averaged embedding of such class represents it sufficiently in the embedding space [19].
To verify this approach, in the fiveshot setting, we train HyperShot with two variants of calculating the inputs to the kernel matrix:

Finegrained—utilizing a hypernetwork that takes as an input a kernel matrix between each of the embeddings of the individual support examples. This kernel matrix has a shape of \((25 \times 25)\).

Averaged—utilizing a hypernetwork where the kernel matrix is calculated between the means of embeddings of each class. The kernel matrix in this approach has a shape of \((5 \times 5)\).
We benchmark both variants of HyperShot on the fiveshot classification task on CUB and miniImageNet datasets, as well as the task of crossdomain Omniglot \(\rightarrow \) EMNIST classification. We report the accuracies in Table 5. It is evident that averaging the embeddings before calculating the kernel matrix yields superior results.
Hidden size: Firstly, as presented in Table 6, we compare different sizes of hidden layers. The results agree with the intuition that the wider the layers, the better the results. However, we also observe that some hidden sizes (e.g., 8188) could be too large to learn effectively. Because of that, we propose to use hidden sizes of 2048 or 4096 as the standard.
Neck and head layers: Then, we compared the influence of the number of neck layers and head layers of HyperShot for the achieved results, as presented in Table 7. We observed that the most critical is the number of head layers—specific for each target network’s layers. Because of that, we propose using the standard number of 3 head layers and using various neck layers—tuning them to the specific task.
5 Conclusion
In this work, we introduced a novel general framework that uses kernel methods combined with hypernetworks. Our method directly relies on the kernelbased representations of the support examples and a hypernetwork paradigm to create the query set’s classification module. We concentrate on relations between embeddings of the support examples instead of direct feature values. Thanks to this approach, models that realize our framework can adapt to highly different tasks.
Specifically, in this paper, we propose a classical (HyperShot) and a Bayesian (BayesHyperShot) realization of our framework. We evaluate both models on various oneshot and fewshot image classification tasks. Both HyperShot and BayesHyperShot demonstrate high accuracy in all tasks, performing comparably or better to stateoftheart solutions. Moreover, both models have a strong ability to generalize, as evidenced by their performances on crossdomain classification tasks.
Finally, we demonstrate the ability of the Bayesian version of our framework to properly quantify the uncertainty of the model’s prediction. As such, BayesHyperShot is able to recognize an outofdistribution samples and return the level of certainty of the classification.
Notes
In the case of the adapted hypernetworks, we tune a copy of the hypernetwork on the support set separately for each validation task. This way, we ensure that our model does not take unfair advantage of the validation tasks.
References
Patacchiola, M., Turner, J., Crowley, E.J., O’Boyle, M., Storkey, A.J.: Bayesian metalearning for the fewshot setting via deep kernels. Adv. Neural Inf. Process. Syst. 33, 16108–16118 (2020)
Sendera, M., Tabor, J., Nowak, A., Bedychaj, A., Patacchiola, M., Trzcinski, T., Spurek, P., Zieba, M.: Nongaussian gaussian processes for fewshot regression. Adv. Neural Inf. Process. Syst. 34, 10285–10298 (2021)
Wang, Z., Miao, Z., Zhen, X., Qiu, Q.: Learning to learn dense gaussian processes for fewshot learning. Adv. Neural Inf. Process. Syst. 34, 13230–13241 (2021)
Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=rkpACe1lx
Qiao, S., Liu, C., Shen, W., Yuille, A.L.: Fewshot image recognition by predicting parameters from activations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238 (2018)
Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Fewshot learning via embedding adaptation with settoset functions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8808–8817 (2020)
Zhmoginov, A., Sandler, M., Vladymyrov, M.: Hypertransformer: Model generation for supervised and semisupervised fewshot learning. In: International Conference on Machine Learning, pp. 27075–27098. PMLR (2022)
Zhu, Z., Wang, L., Guo, S., Wu, G.: A Closer Look at FewShot Video Classification: A New Baseline and Benchmark. arXiv (2021). https://doi.org/10.48550/ARXIV.2110.12358. arXiv:2110.12358
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporalrelational crosstransformers for fewshot action recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 475–484 (2021). https://doi.org/10.1109/CVPR46437.2021.00054
Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at fewshot classification. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=HkxLXnAcFQ
Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: A survey on fewshot learning. ACM Comput. Surv. (csur) 53(3), 1–34 (2020)
Sheikh, A.S., Rasul, K., Merentitis, A., Bergmann, U.: Stochastic maximum likelihood optimization via hypernetworks. arXiv preprint arXiv:1712.01141 (2017)
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, pp. 10–21. Association for Computational Linguistics (ACL) (2016)
Bengio, S., Bengio, Y., Cloutier, J., Gescei, J.: On the optimization of a synaptic learning rule. In: Optimality in Biological and Artificial Networks, pp. 281–303 (2013)
Hospedales, T., Antoniou, A., Micaelli, P., Storkey, A.: Metalearning in neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5149–5169 (2021)
Schmidhuber, J.: Learning to control fastweight memories: an alternative to dynamic recurrent networks. Neural Comput. 4(1), 131–139 (1992). https://doi.org/10.1162/neco.1992.4.1.131
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proc. IEEE (2020). https://doi.org/10.1109/JPROC.2020.3004555
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. Adv. Neural. Inf. Process. Syst. 29, 3630–3638 (2016)
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for fewshot learning. Adv. Neural Inf. Process. Syst. 30 (2017)
Oreshkin, B., Rodríguez López, P., Lacoste, A.: Tadam: Task dependent adaptive metric for improved fewshot learning. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31 (2018). https://proceedings.neurips.cc/paper_files/paper/2018/file/66808e327dc79d135ba18e051673d906Paper.pdf
Hu, Y., Gripon, V., Pateux, S.: Leveraging the feature distribution in transferbased fewshot learning. In: Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part II 30, pp. 487–499. Springer (2021)
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for fewshot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018)
Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Metalearning with differentiable convex optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657–10665 (2019)
Li, Z., Zhou, F., Chen, F., Li, H.: Metasgd: Learning to learn quickly for fewshot learning. arXiv preprint arXiv:1707.09835 (2017)
Finn, C., Abbeel, P., Levine, S.: Modelagnostic metalearning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Nichol, A., Achiam, J., Schulman, J.: On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999 (2018)
Antoniou, A., Edwards, H., Storkey, A.: How to train your MAML. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=HJGven05Y7
Fan, C., Ram, P., Liu, S.: Signmaml: Efficient modelagnostic metalearning by signsgd. In: 5th Workshop on MetaLearning at NeurIPS 2021 (2021)
Ye, H.J., Chao, W.L.: How to train your MAML to excel in fewshot classification. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=49h_IkpJtaE
Rajeswaran, A., Finn, C., Kakade, S.M., Levine, S.: Metalearning with implicit gradients. Adv. Neural. Inf. Process. Syst. 32, 113–124 (2019)
Przewięźlikowski, M., Przybysz, P., Tabor, J., Zięba, M., Spurek, P.: Hypermaml: Fewshot adaptation of deep models with hypernetworks. arXiv preprint arXiv:2205.15745 (2022)
Bauer, M., RojasCarulla, M., Świątkowski, J.B., Schölkopf, B., Turner, R.E.: Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326 (2017)
Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y.W., Rezende, D., Eslami, S.M.A.: Conditional neural processes. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 1704–1713 (2018)
Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., Ahn, S.: Bayesian modelagnostic metalearning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7343–7353 (2018)
Grant, E., Finn, C., Levine, S., Darrell, T., Griffiths, T.: Recasting gradientbased metalearning as hierarchical bayes. In: International Conference on Learning Representations (2018)
Borycki, P., Kubacki, P., Przewięźlikowski, M., Kuśmierczyk, T., Tabor, J., Spurek, P.: Hypernetwork approach to Bayesian maml. arXiv preprint arXiv:2210.02796 (2022)
Ravi, S., Beatson, A.: Amortized Bayesian metalearning. In: International Conference on Learning Representations (2018)
Nguyen, C., Do, T.T., Carneiro, G.: Uncertainty in modelagnostic metalearning using variational inference. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3090–3100 (2020)
Jerfel, G., Grant, E., Griffiths, T.L., Heller, K.: Reconciling metalearning and continual learning with online mixtures of tasks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 9122–9133 (2019)
Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., Turner, R.: Metalearning probabilistic inference for prediction. In: International Conference on Learning Representations (2018)
Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on Machine Learning, pp. 63–71. Springer (2003)
Shen, X., Xiao, Y., Hu, S.X., Sbai, O., Aubry, M.: Reranking for image retrieval and transductive fewshot classification. In: Advances in Neural Information Processing Systems, vol. 34, pp. 25932–25943 (2021)
Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Boosting fewshot visual learning with selfsupervision. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The CaltechUCSD Birds2002011 Dataset. Technical Report CNSTR2011001, California Institute of Technology (2011)
Ravi, S., Larochelle, H.: Optimization as a model for fewshot learning. In: ICLR (2017)
Rajasegaran, J., Khan, S.H., Hayat, M., Khan, F.S., Shah, M.: Metalearning the learning trends shared across tasks. CoRR abs/2010.09291 (2020)
Sendera, M., Przewięźlikowski, M., Karanowski, K., Zięba, M., Tabor, J., Spurek, P.: Hypershot: Fewshot learning by kernel hypernetworks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2469–2478 (2023)
Snell, J., Zemel, R.: Bayesian fewshot classification with onevseach pólyagamma augmented gaussian processes. In: International Conference on Learning Representations (2020)
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P.: Rethinking fewshot image classification: a good embedding is all you need? In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 266–282. Springer (2020)
Rizve, M.N., Khan, S., Khan, F.S., Shah, M.: Exploring complementary strengths of invariant and equivariant representations for fewshot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10836–10846 (2021)
Jian, Y., Torresani, L.: Label hallucination for fewshot classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7005–7014 (2022)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 33 (2011)
Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: Emnist: Extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926 (2017). https://doi.org/10.1109/IJCNN.2017.7966217
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2014)
Acknowledgements
This research was funded in part by National Science Centre, Poland, 2022/45/N/ST6/03374. The work of M. Sendera was supported by the National Centre of Science (Poland) Grant No. 2022/45/N/ST6/03374. The work of J. Tabor was supported by the National Centre of Science (Poland) Grant No. 2019/33/B/ST6/00894. The work of P. Spurek and M. Przewiȩźlikowski was supported by the National Centre of Science (Poland) Grant No. 2021/43/B/ST6/01456. The work of M. Ziȩba was supported by the National Centre of Science (Poland) Grant No. 2020/37/B/ST6/03463.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Comparison to the original HyperShot publication
This work is an extension of “HyperShot: FewShot Learning by Kernel Hypernetworks,” originally published at the WACV 2023 Conference [47]. In this section, we address the points raised by the reviewers of the original publication, and outline which parts of this work are extensions of the original publication.
1.1 A.1 Addressing the WACV 2023 reviews
Hypernetwork architecture The reviewers raised a concern about a lack of a detailed description of our Hypernetwork architecture. While the general architecture had been visualized as a part of Fig. 4, we also expanded Sect. 2.2 to include a more detailed description of the architecture.
Comparisons to related work The reviewers suggested several additional works to which we could compare HyperShot. While we added several of them to the tables reported in our experiments [49, 50], we chose to omit others [42, 43, 51] as the settings of the fewshot problems they solve differ from the standard inductive fewshot classification.
Application to other tasks besides image classification An important point raised by the reviewers was the lack of experiments on different problems besides fewshot image classification. While we acknowledge that there are various different computer vision problems for which dataefficient solutions would be desirable, we felt that such experiments are out of the scope of our work, as HyperShot and BayesHyperShot were tailormade for the classification problem, which is by far the most popular fewshot benchmark. An application of kernel Hypernetworks to other fewshot problems could be an interesting point of future work.
1.2 A.2 Extensions compared to the WACV 2023 paper
The most important difference of this work compared to the WACV 2023 publications is the generalization of the previously introduced HyperShot method to a Bayesian framework, resulting in a model which we call BayesHyperShot. We conduct a series of experiments on the HyperShot model and show that apart from learning from small amounts of data, it is capable of modeling uncertainty of predictions.
Appendix B. Training details
In this section, we present in detail the architecture and hyperparameters of HyperShot.
Architecture overview
From a highlevel perspective, the architecture of HyperShot consists of three parts:

Backbone—a convolutional feature extractor.

Neck—a sequence of zero or more fully connected layers with ReLU nonlinearities in between.

Heads—for each parameter of the target network, a sequence of one or more linear layers, which predicts the values of that parameter. All heads of HyperShot have identical lengths, hidden sizes and input sizes that depend on the generated parameter’s size.
The target network generated by both HyperShot and BayesHyperShot reuses its backbone. We outline this architecture in Fig. 4.
Backbone For each experiment described in the main body of this work, we follow [1] in using a shallow backbone (feature extractor) for HyperShot as well as referential models. This backbone consists of four convolutional layers, each consisting of a convolution, batch normalization and ReLU nonlinearity. Apart from the first convolution, which has the input size equal to the number of image channels, each convolution has an input and output size of 64. We apply maxpooling between each convolution, which decreases by half the resolution of the processed feature maps. The output of the backbone is flattened so that the further layers can process it.
Datasets For the purpose of making a fair comparison, we follow the procedure presented in, e.g., [1, 10]. In the case of the CUB dataset [44], we split the whole amount of 200 classes (11788 images) across train, validation and test consisting of 100, 50 and 50 classes, respectively [10]. The miniImageNet dataset [45] is created as the subset of ImageNet [52], which consists of 100 different classes represented by 600 images for each one. We followed the standard procedure and divided the miniImageNet into 64 classes for the train, 16 for the validation set and the remaining 20 classes for the test. The wellknown Omniglot dataset [53] is a collection of characters from 50 different languages. The Omniglot contains 1623 white and black characters in total. We utilize the standard procedure to include the examples rotated by \(90^\circ \) and increase the size of the dataset to 6492, from which 4114 were further used in training. Finally, the EMNIST dataset [54] collects the characters and digits coming from the English alphabet, which we split into 31 classes for the test and 31 for validation.
Data augmentation We apply data augmentation during model training in all experiments, except Omniglot \(\rightarrow \) EMNIST crossdomain classification. The augmentation pipeline is identical to the one used by [1] and consists of the random crop, horizontal flip and color jitter steps.
Appendix C. Hyperparameters
Below, we outline the hyperparameters of architecture and training procedures used in each experiment.
We use cosine similarity as a kernel function and averaged support embeddings aggregation in all experiments. HyperShot is trained with the learning rate of 0.001 with the Adam optimizer [55] and no learning rate scheduler. Taskspecific adaptation is also performed with the Adam optimizer and the learning rate of 0.0001.
For the natural image tasks (CUB, miniImageNet, miniImageNet \(\rightarrow \) CUB classification), we use a hypernetwork with the neck length of 2, head lengths of 3 and a hidden size of 4096, which produce a target network with a single fully connected layer. We perform training for 10000 epochs.
For the simpler Omniglot \(\rightarrow \) EMNIST character classification task, we train a smaller hypernetwork with the neck length of 1, head lengths of 2 and the hidden size of 512, which produces a target network with two fully connected layers and a hidden size of 128. We train this hypernetwork for a shorter number of epochs, namely 2000.
We summarize all the above hyperparameters in Table 8.
Appendix D. Source code
The source code required for running the experiments is available at https://github.com/gmum/fewshothypernetspublic.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sendera, M., Przewiȩźlikowski, M., Miksa, J. et al. The general framework for fewshot learning by kernel HyperNetworks. Machine Vision and Applications 34, 53 (2023). https://doi.org/10.1007/s00138023014034
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138023014034