1 Introduction

Image classification [67, 142] is an important application in computer vision [4, 162] and machine learning [91, 193]. With the continuous development of deep learning [5, 79, 132], recent years have witnessed great breakthroughs in this area [48, 153]. However, such success relies on a huge amount of data [22, 136] (usually in the order of million), which is difficult and time-consuming in the real world. In order to reduce the data requirement, there has been growing interest in small-sample image classification [80, 140, 201], such as few-shot classification [1, 18, 115], which learns a classification rule from few (1-5) labeled samples.

A core challenge in few-shot image classification is to alleviate the susceptibility of models to overfitting under few-data regime [27, 110, 168]. To address this problem, researchers have proposed several promising approaches, such as transfer learning [123, 203], meta-learning [38, 122, 145] and data augmentation [7, 16, 57]. In transfer learning, a model is first trained on a source domain where abundant source data is available. Then this trained model is fine-tuned [15, 137, 195] on another target domain with few labeled target samples. The learnt prior knowledge can be transferred from source tasks to target tasks during this process. Meta-learning, or learning to learn, has emerged as one of the prominent approaches for few-shot learning. It is proposed to train a meta-learner which can quickly generalize to new tasks with few examples [33, 45, 165, 178]. A meta-learning procedure also involves learning at two levels, within and across tasks. Meta-learning approaches simulate the tasks that will be presented at inference through episodic training [116, 170, 202], enabling the generalization ability of meta-learner within minor adaption steps. Data augmentation methods are often used as preprocessing in few-shot learning (FSL). In order to solve the problem of insufficient training data, they introduce various kinds of existing data variance for the model to capture. For image classification, one commonly used method is deformation [69, 119, 164, 185], including horizontal flipping, cropping and rotation. Besides these, more advanced methods, such as generating training samples and pseudolabels [28, 29, 192], are also an important part of data augmentation.

In this paper, we present a survey of recent meta-learning methods for few-shot image classification. Meta-learning focuses on learning prior knowledge from previous tasks which can bring efficient downstream learning to new tasks. This learning mechanism enables models can learn new concepts quickly where only few samples are available. Meta-learning deserves special attention as it is an essential part of few-shot image classification and it has also demonstrated outstanding performance on benchmark datasets [64, 144]. To be specific, in this survey we divide meta-learning into three categories according to the different mechanisms, namely metric-based, model-based and optimization-based methods [40, 58, 89, 166].

A number of surveys on FSL have been proposed. In 2018, Shu et al. [140] provided an early survey on small-sample learning, discussing approaches for different scenarios (zero-shot learning [124, 179, 180] and FSL) and tasks (image classification, visual question answering [6, 90, 139] and object detection [62, 114, 151]). Wang et al. [167] conducted a comprehensive review in 2021, which provides a formal definition of FSL and distinguishes it from other machine learning problems, exploring FSL from a fundamental viewpoint of error decomposition in supervised learning. Li et al. [74] published another comprehensive review on FSL in 2021, which is entirely focused on meta-learning and review literature [39, 43, 44, 156] over a long period in this area. There is another review on few-shot image classification [76] published in 2023, which is fully devoted to metric learning methods [103, 141, 188]. Compared with these surveys [74, 76, 140, 167], our review presents an up-to-date survey of meta-learning approaches for few-shot image classification and provides a thorough analysis of these different kinds of methods to better understand their individual strengths and limitations.

The remainder of this survey is organized as follows. In Sect. 2, we provide the preliminary concepts of meta-learning, including the definition of few-shot image classification, commonly used datasets and the evaluation procedure. In Sect. 3, we mainly introduce the category of meta-learning methods and review both classical and state-of-the-art meta-learning approaches. We also present other kinds of few-shot learning methods to do a comparison. In Sect. 4, we discuss the major challenges, along with future directions. Finally, we conclude this survey in Sect. 5.

2 The framework of few-shot image classification

2.1 Notation and definitions

In this section, we first present a brief introduction about few-shot learning and meta-learning, and then provide the notation and unified definitions of few-shot image classification [23, 56, 155].

Few-shot learning is a surprising research area that focuses on learning patterns from a set of data (base classes) and then adapting to a disjoint set (novel classes) with limited training samples. Few-shot image classification is the one with most attention and researches. As the most popular approach for few-shot learning, meta-learning organizes the learning process into two phases, called meta-training and meta-testing. During each phase, the meta-training set or meta-testing set is split into multiple episodes. Each episode samples from the task distribution and is further divided into a small training set and a testing set.

In the standard few-shot image classification setting, two distinct datasets are involved, namely base dataset \({{D}_{{ base}}}=\left\{ \left( {{x}_{i}},{{y}_{i}}\right) ;{{x}_{i}}\in {{X}_{base}},{{y}_{i}}\in {{Y}_{base}} \right\} _{i=1}^{{{N}_{base}}}\) and novel dataset \({{D}_{{ novel}}}=\left\{ \left( {{x}_{i}},{{y}_{i}}\right) ;{{x}_{i}}\in {{X}_{{ novel}}},{{y}_{i}}\in {{Y}_{{ novel}}} \right\} _{i=1}^{{{N}_{{ novel}}}}\), where \({{x}_{i}}\) represents the original feature vector of i-th image and \({{y}_{i}}\) is the corresponding class label; \({{N}_{{ base}}}\) and \({{N}_{{ novel}}}\) denote the total numbers of instances in \({{D}_{{ base}}}\) and \({{D}_{{ novel}}}\), respectively. The base dataset is an auxiliary dataset that is used to train the classifier to learn some prior or shared knowledge and the novel dataset is used for the classifier to perform new classification tasks. Note that \({{D}_{{base}}}\) and \({{D}_{{novel}}}\) are disjoint, which means \({{Y}_{base}}\cap {{Y}_{{novel}}}=\varnothing \). In order to train and test the classifier, the \({{D}_{{novel}}}\) is usually split into the support set \({{D}_{S}}\) and the query set \({{D}_{Q}}\) and they share the same label space.

Definition 1

The few-shot image classification task aims to learn a classifier from \({{D}_{{base}}}\) and \({{D}_{S}}\) to correctly classify the samples in \({{D}_{Q}}\). It is generally termed as a N-way K-shot problem, where N and K denote the number of classes and instances in \({{D}_{S}}\), respectively. If \({K=1}\), it becomes a one-shot image classification task; and if \({K=0}\), then the task is called zero-shot classification.

Definition 2

A few-shot image classification task is called cross-domain few-shot image classification when the base dataset and the novel dataset are from two different domains, i.e., \({{X}_{{ base}}}\ne {{X}_{{ novel}}}\).

Fig. 1
figure 1

Sample images of these benchmark datasets widely used for few-shot image classification

2.2 Datasets

In this section, we briefly introduce several well-known datasets for few-shot image classification. According to different data types, we categorize them into simple image dataset (Omniglot [70]), complex image dataset (MiniImageNet [120, 161], TieredImageNet [122], CIFAR-FS [10] and FC100 [107]) and special image dataset (CUB-200 [163, 175]). Among these datasets, CIFAR-FS and FC100 are considered more difficult as the resolution of images from the two datasets is \(32\times 32\). It is more challenging for models to extract useful information from low-resolution images. Statistics of these datasets and popular experimental settings are summarized below. We also present some sample images from these benchmark datasets in Fig. 1.

Omniglot is one of the most frequently used benchmarks for evaluating few-shot image classification algorithms. It contains 1623 handwritten characters collected from 50 different alphabets. Each character consists of 20 samples, drawn by different human subjects. This dataset is usually augmented by the rotations in multiples of 90 degrees, and 1200 characters are used for training and the rest for evaluation.

MiniImageNet and TieredImageNet are two mini versions of the large ImageNet dataset [129]. MiniImaget is composed of 60,000 color images from 100 classes, with 600 images in each class. Following the widely used splitting protocol proposed by Revi and Larochelle [120], 64 classes are used for training, 16 classes for validation and 20 classes for evaluation. TieredIamgeNet is another larger subset of ImageNet with a hierarchical structure. It contains 779,165 images from 34 high-level categories (or 608 classes), which are further split into 20 base categories (351 classes), 6 validation categories (97 classes) and 8 novel categories (160 classes).

CIFAR-FS and FC100 are two widely used datasets derived from CIFAR-100 [68]. CIFAR-FS is constructed from 100 classes with 600 images per class. The 100 classes are split into 64, 16 and 20 classes for training, validation and evaluation, respectively. FC100 also contains 100 classes, which are further divided into 20 super-categories, with five classes in each super-categories. FC100 is split into 12 base, 4 validation and 4 novel super-categories.

CUB-200 is a fine-grained dataset consisting of 200 bird species. The CUB-200 dataset has two versions, while the initial version was proposed in 2010 [175] which includes 6033 images and is extended to 11,788 images in 2011 [163]. The CUB-200-2010 dataset is often split into 130 base, 20 validation and 50 novel classes [85], while the CUB-200-2011 dataset is divided into 100 classes for training, 50 classes for validation and 50 classes for testing [18].

MiniImageNet \(\rightarrow \) CUB is a dataset designed for cross-domain few-shot image classification [93, 159, 169, 190]. MiniImageNet plays the role of the base dataset, while 50 classes of CUB-200-2011 are used for validation, and the remaining 50 classes serve for evaluation.

Fig. 2
figure 2

An overview of few-shot image classification approaches

2.3 Evaluation process of few-shot image classification

In this section, we present a general procedure [30, 55, 148, 177, 200] to evaluate a classifier’s performance on N-way K-shot image classification problems in Algorithm 1. The whole evaluation process is composed of lots of episodes. In each episode, we first randomly select N classes from the novel label space with K samples in each class to form a support set \({D}_{S}\) and M examples from the rest samples of those N classes to compose a query set \({D}_{Q}\). A final classifier can be obtained based on the base dataset and support set, which is used to predict labels of samples in \({D}_{Q}\). We use \(ac{{c}^{\left( e\right) }}\) to denote the classification accuracy in the e-th episode, and the performance of a learning algorithm can be measured by the averaged classification accuracy over all episodes.

figure a
Fig. 3
figure 3

Siamese network architecture

3 Paradigms of meta-learning for few-shot image classification

The goal of meta-learning for few-shot image classification [24, 61, 82, 94, 111] is to enable models, especially deep neural networks, to perform well on new tasks when only few samples are available. With the rapid development of few-shot learning [50, 183, 205], a number of meta-learning approaches [19, 102, 184] have been proposed. In this section, we provide a comprehensive overview of recent meta-learning studies and their advances. In order to let beginners better understand, we follow the main trend and still categorize meta-learning into metric-based, model-based and optimization-based methods. Besides, we also present other few-shot learning methods to make a comparison. Figure 2 shows an overview of few-shot image classification.

3.1 Metric-based meta-learning

Metric-based meta-learning methods [49, 72, 75, 194] aim to learn a distance metric, which can effectively measure the similarity among samples, ensuring it is optimal for new learning tasks. For few-shot image classification problems, the learned metric should follow the rules that enable samples from the same (or different) class should a small (or large) distance.

Fig. 4
figure 4

Prototypical network architecture

Siamese network is one of the most widely used metric-based methods for one-shot image classification. The term “Siamese” was first proposed for signature verification [13] and the principal structure of Siamese network was introduced for the fingerprint similarity estimation problem [9]. In 2015, Koch et al. [65] adopted a pair of identical VGG-styled [142] convolutional layers with shared weights to extract high-level features from two input images and calculate the weighted \({L}_{1}\) distance between the two feature vectors. The network finally outputs a score, representing the probability that the two images belong to the same class. The architecture of Siamese network is shown in Fig. 3. Wang et al. [173] proposed an attention-based Siamese network, which exploits an attention kernel function to measure the similarity between two feature vectors. To bridge the gap between one-shot image recognition [17, 32, 160] and regular classification, Lungu et al. [88] proposed a multi-resolution Siamese network, which mixes different kernel size streams into one layer and adopts a hybrid training mechanism.

As another powerful metric-based meta-learning method, matching network [161] uses different networks to encode support and query images. For support images embedding, a bidirectional long-short-term memory (LSTM) [198] is used in the context of the support set \({D}_{S}\); for query images embedding, an LSTM with an attention kernel is taken to enable the dependency on \({D}_{S}\), where the attention kernel [12, 105, 106] is used to compute cosine similarities between support and query images and then normalize the similarities through a softmax function. Matching network’s output is defined as a sum of the labels (one-hot encoded) of support images weighted by the attention kernel. In 2019, Mai et al. [92] proposed an attentive matching network (AMN), introducing a feature-level attention mechanism to pay more attention to the features that can better reflect the inter-class differences and a complementary cosine loss function for optimization.

The initial prototypical network was proposed by Snell et al. [145] based on the hypothesis that there exists an embedding space where each class can be represented by a unique prototype, and all samples are supposed to cluster around their corresponding prototypes. Figure 4 shows the architecture of prototypical network. A simple convolutional neural network with 4 layers is exploited to extract features, and the prototype of each class is defined as the mean value of feature embeddings from the support samples belonging to that class. The squared Euclidean distance is employed as a distance metric, calculating the distance between query embeddings and each class prototype. Build on this, Li et al. [86] proposed a covariance metric network (CovaMNet), using the covariance matrix of embedding vectors to represent the class prototype and also apply a covariance-based metric to measure the similarity between the query sample and the class prototype. Wang and Zhai [172] proposed a prototypical Siamese network (PSN), adding a prototype module in Siamese network to obtain high-quality prototype representations of each class.

Fig. 5
figure 5

Relation network architecture

Relation network [149] is the first study that employs a neural network to estimate a similarity score of feature embeddings rather than manual computation. This model consists of two main components: an embedding module and a relation module. The embedding module is composed of convolutional blocks, mapping input images into an embedding space; and the relation module builds on two convolutional blocks and two fully connected layers, calculating a relation score between each query and support image (or a class prototype when the number of support samples is more than one). Note that the feature embeddings of support and query images need to be concatenated together before they are fed into the relation module. The architecture of relation network is presented in Fig. 5. In order to obtain discriminative features for fine-grained image classification [35, 59, 204], the subsequent work [73] proposed a bi-similarity network (BSNet), which combines an extra cosine module with the existing similarity measure as a new relation module, generating a more compact feature space by forcing features to adapt to the new relation module.

Table 1 A summary of presented metric-based meta-learning approaches

In order to get optimal matching image regions, Zhang et al. [196] proposed a DeepEMD algorithm, which adopts the earth mover’s distance (EMD) [112, 127, 191] as a distance metric to calculate the similarity. They introduce a cross-reference mechanism to produce the weights of elements in the EMD formulation and embed the EMD layer into the network for end-to-end training. Motivated by this, Xie et al. [181] proposed a deep Brownian distance covariance (DeepBDC) approach, which applies BDC metric for few-shot learning. To learn discriminative feature representations, Afrasiyabi et al. [2] proposed a mixture-based feature space learning (MixtFSL) approach, learning both the feature representations and the mixture model via an online manner. Different from those few-shot classification methods that extract a single feature vector from each image, Afrasiyabi et al. [3] held the view that a set-based representation can build a richer and more robust representation of images from base classes. To do so, they proposed a matching feature sets method which embeds self-attention modules in between convolutional blocks and introduces set-to-set metrics for evaluation. We summarize those introduced metric-based meta-learning approaches in Table 1.

Fig. 6
figure 6

Neural Turing machine scheme

3.2 Model-based meta-learning

With the goal of fast learning, model-based methods [63, 104] mainly focus on model architectures, adjusting model parameters based on presented tasks. There are several frequently used architectures in model-based methods, such as convolutional neural networks (CNNs) [71], recurrent neural networks (RNNs) [128, 134] and long short-term memory (LSTM) [54]. According to the model architecture types, these model-based methods are further separated into memory-based, rapid adaptation-based and miscellaneous models.

Memory-augmented neural network (MANN) is a famous memory-based method proposed by Santoro et al. [131], which aims at improving task adaptation by utilizing the neural Turing machine (NTM) [21, 36, 47]. NTM is a neural network that integrates an external memory component during its learning process, enabling it has access to retrieve previously stored information. To be specific, NTM consists of a controller, interacting with an external memory module via a number of read and write heads. The NTM scheme is shown in Fig. 6. In MANN, a new addressing mechanism, namely least recently used access (LRUA) [131], is proposed, writing memories to either the least used memory location or the most recently used memory location. Through the stored information of a coupled representation-class label in the external memory, MANN can access them for later classification. Tran et al. [157] proposed a memory-augmented matching network (MAMN), which combines MANN and matching network. In MAMN, to reduce the biased on class prototypes caused by data distribution skew, weighted class prototypes are introduced by incorporating the distances of classwise samples. As another memory-based meta-learning method, memory matching network (MM-Net) [14] incorporates the memory module extracted from key-value memory network [96] into matching network. Different from traditional one-shot learning methods, MM-Net encodes and generalizes the whole support set into memory slots and can generate a unified model regardless of the number of shots and categories.

Fig. 7
figure 7

Conditional neural processes scheme

Meta-network (MetaNet) [100] is a model that designed with specific architecture and training process for rapid adaption across tasks. Meta-network contains a base learner, a meta-learner and an external memory. It performs a generic knowledge acquisition in a meta-space and shifts its inductive biases via fast parameterization for rapid generalization. Conditional shifted neurons (CSNs) [101] is a generic neural mechanism designed for fast adaption, which is able to extract conditional information and generate conditional shifts for prediction during the meta-learning process. Compared with previous works [97, 100, 131], CSNs is more efficient computationally as the number of neurons is usually much smaller than that of weight parameters. Moreover, CSNs can be integrated into various neural architectures, including CNNs and RNNs. Similar to MetaNet, CSNs contains a base learner, a meta-learner and a memory module. During the description time, the meta-learner extracts and employs conditional information to generate memory values for samples within a task; at the prediction phase, the meta-learner generates query keys of query images by a key function for the purpose of getting the value of conditional shift.

Table 2 A summary of presented model-based meta-learning approaches

Simple neural attentive learner (SNAIL) [98] is a general model-based meta-learning architecture that incorporates temporal convolution and soft attention mechanism. The temporal convolution acts as high-bandwidth memory access, and the soft attention enables access to specific pieces of information. This combination enables models to better leverage information from past experiences. Similar to SNAIL, Garnelo et al. [42] proposed conditional neural processes (CNPs) which consists of a meta-learner and task learner. The meta-learner generates a memory value by aggregating representations of the support set, and the task learner makes predictions by processing the aggregated representations. Figure 7 shows the CNPs scheme. We also make a short summary of those model-based meta-learning approaches and present it in Table 2.

3.3 Optimization-based meta-learning

Optimization-based meta-learning methods are an important vital branch in the field of few-shot image classification [11, 20, 37, 41, 121]. Basically, this kind of algorithm attempts to obtain a better initialization model or gradient descent direction by leveraging the meta-learning architecture and optimizes the initialization parameters through episodic training, enabling an optimization procedure to work on a small number of training samples. Optimization-based methods generally contain a task-specific learner trained for a given task and a meta-learner trained on distributions of tasks.

In 2017, Finn et al. [38] proposed model-agnostic meta-learning (MAML), the first algorithm for learning an initialization. The key idea of MAML is to enable a model’s parameters can adapt fast to new unseen tasks through the gradient-based learning rule. During the meta-training phase, MAML attempt to update the task-specific parameters and the global initialization jointly in an iterative manner. The MAML scheme is presented in Fig. 8. The main contribution of MAML is its compatibility in different application domains, not only in classification, but also in regression [133, 135, 199] and reinforcement learning [34, 51, 84]. To address the limitation of neural networks that are trained with gradient-based optimization on few-shot learning tasks [26, 143, 186], Ravi and Larochelle [120] proposed an LSTM-based meta-learner to learn both the exact task-specific optimization of a classifier, as well as good initialization values for the parameters of task-specific learner.

Fig. 8
figure 8

Model-agnostic meta-learning scheme

Fig. 9
figure 9

Latent embedding optimization scheme

By taking ideas from prototypical network and MAML, Triantafillou et al. [158] proposed Proto-MAML, incorporating the advantages of both the former’s simple inductive bias and the latter’s flexible adaptation mechanism. As an extension to MAML, CAVIA [206] divides the model into parameters and task-specific context parameters which are shared across tasks. Compared with MAML, CAVIA is less prone to meta-overfitting and easier to parallelize. To address the issue that meta-learning models would be too biased toward existing tasks and lead to poor generalization, Jamal and Qi [60] proposed a task-agnostic meta-learning (TAML) algorithm, where two approaches are exploited to train a model unbiased over tasks. In order to improve generalization performance, BaiK et al. [8] proposed a novel framework called meta-learning with task-adaptive loss function (MeTAL). Particularly, MeTAL learns a task-adaptive loss function through two meta-learners and can be applied to different MAML variants.

Wang et al. [171] introduced a new approach called task-aware feature embeddings for low-shot learning (TAFE-Net) which mainly concentrates on tuning task-specific feature embedding through the generic embedding of a meta-learner. TAFE-Net is composed of a meta-learner and a prediction network, where the task-aware feature embedding is obtained by utilizing the meta-learner to develop task-specific feature layers of the prediction network. Sun et al. [152] introduced a meta-transfer learner (MTL) method, which focuses on generating task-specific feature extractors by leveraging both meta-learning and transfer learning. In MTL, scaling and shifting operations are introduced on pre-trained feature embeddings to freeze the feature extractor. Besides, similar fine-tuning steps are taken in MTL as those in previous work [18]. This work also proposed a novel hard task meta-batch process that put more focus on hard tasks through sampling extra instances from the classes that the classifier failed.

Considering difficulties that exist in optimization on high-dimensional parameter spaces such as those faced by MAML [38], Rusu et al. [130] proposed an innovative algorithm called latent embedding optimization (LEO) that learns a low-dimensional latent representation of model parameters and performs optimization-based meta-learning in this space. Similar to MAML, LEO also consists of an inner loop training where the task-specific values are learned and an outer loop training where global shared initializations are updated. To instantiate low-dimensional latent embedding of model’s parameters, samples pass through a combination of an encoder and a relation network. The encoder is used to generate hidden codes from the support set. Then, these hidden codes are concatenated pairwise and fed into a relation network, leading to a probability distribution over latent codes in a lower dimension. Finally, the decoder produces task-specific initial parameters which are differentiable to backpropagate for adaptation. The LEO scheme is shown in Fig. 9. We present a short summary of optimization-based meta-learning approaches in Table 3.

Table 3 A summary of presented optimization-based meta-learning approaches

3.4 Other methods

Transfer learning involves leveraging knowledge learned from a related task to enhance learning in a new task [52, 125, 126, 187, 189]. In the few-shot image classification scenario, transferring knowledge from another network is a viable option when original data is too limited to train a deep neural network from scratch. Compared with meta-learning, the learning experience involved in transfer learning is much narrower. To address few-shot hyperspectral image classification problems, Qu et al. [118] applied the transfer learning scheme to extract learned intrinsic representations from the same kind of objects in different domains. Tai et al. [154] proposed a novel few-shot transfer learning approach for synthetic aperture radar image classification, which uses a connection-free attention module to transfer features from a source network to a target network. Sun and Yang [147] proposed trans-transfer learning, a two-phase learning method for few-shot fine-grained visual categorization problems. In some cases, knowledge transfer may also fail when the source domain and target domain are not related to each other, even causing negative transfer. To address this problem, Liu et al. [83] proposed an analogical transfer learning (ATL), following the analogy strategy to effectively control the occurrence of negative transfer.

Table 4 Accuracy results on Omniglot dataset reported in original papers, with mean accuracy (%) and 95% confidence interval. i: metric-based; ii: model-based; iii: optimization-based
Table 5 Accuracy results on MiniImageNet and TieredImageNet datasets reported in original papers, with mean accuracy (%) and 95% confidence interval. i: metric-based; ii: model-based; iii: optimization-based

Considering the fundamental problem in few-shot image classification that models are prone to overfitting caused by few training samples, many researchers proposed a number of data augmentation approaches [108, 117, 174] to improve sample diversity and prevent overfitting during training. Goodfellow et al. [46] proposed the well-known Generative Adversarial Nets (GAN), which contains a generator for generating similar images and a discriminator for distinguishing. Based on GAN, Mehrotra and Dukkipati [95] proposed to generate samples for specific tasks, enabling these generated samples more suitable for few-shot learning. Zhang et al. [197] proposed MetaGAN. To help the classifier learn a clearer decision boundary, MetaGAN involves GAN and part of the classification network during the training process. Li et al. [87] proposed Adversarial Feature Hallucination Network (AFHN), using conditional Wasserstein Generative Adversarial Network (cWGAN) to generate samples.

We present experimental results of recent meta-learning methods in Tables 4 and 5. Table 4 shows performances of different approaches on Omniglot. Omniglot is a handwritten dataset with multiple handwriting styles, languages and stroke types, this diversity makes Omniglot suitable for training deep learning algorithms. Table 4 shows that most meta-learning approaches obtain over 98% accuracies on Omniglot. Table 5 shows experimental results on MiniImgeNet and TieredImageNet. These two datasets contain images with different objects, scenes and lighting conditions, which can improve the model’s robustness. However, the limitations in dataset size and image quality may affect the model’s performance. Table 5 shows that DeepBDC [181] and matching feature sets [3] achieved best results on both datasets.

4 Major challenges and future directions

Although meta-learning methods have achieved promising performance in few-shot image classification, there remain some vital challenges that ought to be dealt with in the future. These existing issues and suggested future research directions are outlined here.

4.1 Limitations and challenges

  • Data availability and computational complexity. In image classification, a large dataset typically has a thousand (or more) categories. Meta-learning approaches also require a large amount of data and computational resources, but in few-shot scenarios, it is quite challenging to collect sufficient data. For deep testing of meta-learning we may need thousands of large datasets! This may also be very difficult and slow to process.

  • Model selection There is not a one-size-fits-all so selecting an appropriate model is important. Model selection is more crucial in few-shot image classification scenarios as the model is prone to overfitting the training data. The model may perform well on the base set and lacks generalization on new tasks.

  • Transferability Meta-learning models can transfer learned knowledge between various tasks. The success of transferability depends on the similarity between the tasks. Sometimes new tasks may have significant differences from old ones, making it difficult to transfer learned knowledge effectively, such as cross-domain tasks.

  • Task dependence Most meta-learning approaches are designed to work for a specific set of tasks or domains. They may not perform well on new tasks or domains that are significantly different from the ones used during training. Improving meta-learning’s generalization ability can be a hard task.

  • Interpretability Interpretability is a critical aspect of neural approaches that refers to the ability to understand how a model works. Unfortunately, all neural approaches can be extremely challenging to interpret and thus difficult to understand how it learns to learn and make predictions or decisions. This issue can make it arduous to debug, diagnose and improve models’ performances.

4.2 Future directions

  • Enhancing generalized feature learning To address the main challenge in few-shot learning that learn from a handful of samples [81, 146, 182], meta-learning employs shared knowledge from previously experienced tasks for unseen tasks. However, in most existing meta-learning methods, researchers attempt to learn discriminative features via attention mechanism, multitask learning, data augmentation and so on. One major research direction is developing new approaches for learning features that generalize better to new domains; and evaluation measures for assessment and selection of the learned features.

  • Practice of episodic training strategy In order to realize fast adaption to new tasks with limited samples, episodic training requires that each training episode should have the same number of classes and examples as the evaluation episode. But, this setting is prone to catastrophic forgetting [31, 138, 176] and leads to model underfitting in base classes. A number of approaches have been proposed to address this issue, and improving model performance on both base and novel classes remains a vital direction for future work.

  • Improving stability Despite the continuous improvement of meta-learning in few-shot image classification, one existing issue is that some meta-learning methods obtain state-of-the-art performance on special datasets, but perform not well on other benchmarks. For example, a metric-based meta-learning method named global class representation (GCR) [78] achieved great performance on Omniglot, but cannot compete with other non-metric-based methods on miniImageNet. Further exploration of stable models [25, 66] will be very valuable.

  • Cross-domain and multimodal meta-learning In principle, the base dataset \({D}_{{base}}\) and novel dataset \({D}_{{novel}}\) in few-shot learning can be from different domains [77, 150]. However, most model performances will decline when the difference between \({D}_{{base}}\) and \({D}_{{novel}}\). Developing meta-learning methods on cross-domain performance can be one future research direction. Multimodal deep learning has also brought great opportunities to few-shot learning [53, 99, 109]. For example, Peng et al. [113] proposed a Knowledge Transfer Network (KTN), which combines semantic features and image features for few-shot image classification tasks. Therefore, how to design a more appropriate multimodal fusion method is a research trend in few-shot image classification.

5 Conclusions

This paper presents a survey comprised of over 200 papers on recent few-shot learning and meta-learning research for image understanding. Based on the research literature, we introduce the general approaches for few-shot learning and then turn to one of the key approaches called meta-learning. We separate existing meta-learning methods into three important categories: metric-based, model-based and optimization-based methods. We introduce both classical and state-of-the-art approaches in each category and summarize the state of the art. We also present the state-of-the-art performance of the literature approaches on well-known datasets. According to our study, we conclude with limitations, challenges and weaknesses for meta-learning and present promising directions of meta-learning from the perspectives of generalization, effectiveness and applicability.