This part reports on the findings of our extensive literature review conducted by examining relevant works dealing with learners’ data-efficiency issue. Different perceptions to approach the problem lead to different ways to solve it. Based on the study of the related body of research, we distilled four main strategies to alleviate algorithms data hungriness. Each one is spanning its own spectrum and together they shape the advanced in this research landscape. Figure 5 categorizes existing techniques into a unified taxonomy and organizes them under the umbrella of each strategy. We devote a section to each strategy. First, we point out research exploring learning algorithms that go beyond the realm of supervised learning. Second, we review relevant techniques to enlarge artificially the training dataset. Third, we overview the different forms that learning from previous experiences can take. Finally, we introduce a new research direction that aims to conceive innovative hybrid systems that combine both high-prediction, explainability, and data-efficiency.
Non-supervised learning paradigms
When talking about data hungriness in ML, we are mostly referring to supervised learning algorithms, it is this type of learning that had the most voracious appetite for data. Supervised methods need labelled data to build classification and regression models and the performance of these models relies heavily on the size of labelled training data available. One straightforward strategy to alleviate this data-dependency would be then to use other learning paradigms. Paradigms that either do not require pre-existing data and could generate ones by interacting with their environment (i.e. reinforcement learning), or paradigms that need only small set of labelled data (i.e. semi-supervised learning), or paradigms that use for learning raw unlabelled data (i.e. unsupervised learning). In this section, we scan recent methods in the literature that involve these non-supervised learning paradigms.
Semi‐supervised learning methods
The wide availability of unlabeled data in several real-world scenarios, and, at the same time, the lack of labeled data has naturally resulted in the development of semi-supervised learning (SSL) . SSL is an extension of supervised learning that uses unlabeled data in conjunction with labeled data for better learning. SSL can also be viewed as unsupervised learning with some additional labeled data. Accordingly, SSL may refer to either semi-supervised classification  where unlabeled data are used for regularization purposes under particular distributional assumptions to enhance supervised classification. Or semi-supervised clustering , where labeled data are used to define some constraints to obtain better-defined clusters than the ones obtained from unlabeled data. In the literature, most attention has been paid to the methods of these two groups. Relatively less studies deal with other supervised/unsupervised problems such as semi-supervised regression  and semi-supervised dimensionality reduction . Depending on the nature of the training function, SSL methods are commonly divided, in the literature, into two settings: inductive and transductive. Given a training dataset, inductive SSL attempts to predict the labels on unseen future data, while transductive SSL attempts to predict the labels on unlabeled instances taken from the training set . Abroad variety of SSL methods have been proposed in the two settings. These methods differ in how they make use of unlabeled data, and in the way they relate to supervised algorithms. Next, we review the most three dominant families of methods namely: (i) self-labeled methods, (ii) graph-based methods, and (iii) extended supervised methods.
These techniques are used to solve classification tasks, they aim to obtain enlarged labeled data by assigning labels to unlabeled data using their own predictions . As general pattern, one or more supervised base learners are iteratively trained with the original labeled data as well as previously unlabeled data that is augmented with predictions from earlier iterations of the learners. The latter is commonly referred to as pseudo-labeled data. The main advantage of this iterative SSL approach is that it can be “wrapped” around any supervised learner.
The basic iterative process schema for self-labeled techniques is self-training , it consists of a single supervised classifier that is iteratively trained on both labeled data and data that has been pseudo-labeled in previous iterations of the algorithm. Tanha et al.  discussed the choice of the base learner, they stated that the most important aspect of the learner is to correctly estimate the confidence of the predictions so as to be successful. They experimentally showed that ensemble learner as a base learner gives an extra improvement over the basic decision tree learners. Livieris et al.  proposed an algorithm that dynamically selects the most promising base learner from a pool of classifiers based on the number of the most confident predictions of unlabeled data. Li and Zhou  addressed the issue of erroneous initial predictions that can lead to the generation of incorrectly labeled data, they presented the SETRED method which incorporates data editing in the self-training framework in order to actively learn from the self-labeled examples.
Co-training is a variant of self-training schema that uses multiple supervised classifiers . Considered as a special case of the multiview learning , cotraining schema assumes that, by dividing the feature space into two separate categories, it is more effective to predict the unlabeled data each time. In the work of Didaci et al. , the relation between the performance of cotraining and the size of the labeled training set was examined, and their results showed that high performance was achieved even in cases where the algorithm was provided with very few instances per class. Jiang et al.  introduced a hybrid method which combines the predictions of two different types of a generative classifier (Naive Bayes) and a discriminative classifier (Support Vector Machine) to take advantage of both methods. The final prediction is controlled by a parameter that controls the weights between the two classifiers. Their Experimental results showed that their method performs much when the amount of labeled data is small. Qiao et al.  proposed a deep cotraining method that trains multiple deep neural networks (DNN) to be the different views and exploits adversarial examples to encourage view difference, in order to prevent the networks from collapsing into each other. As a result, the co-trained networks provide different and complementary information about the data.
Transductive methods typically define a graph over all data points, both labeled and unlabeled, the nodes of the graph are specified by unlabeled and labeled samples, whereas the edges specify the similarities among the labeled as well as unlabeled samples . The common graph-based SSL methods are based on a two-stage process that are: (i) constructing a graph from the samples and then (ii) propagating the partial labels to infer those unknown labels via the graph . Initial research on graph-based methods was focused on the inference phase. Pang and Lee  approached the inference from a min-cut perspective. They used the min-cut approach for classification in the context of sentiment analysis. Other works approached graph-based inference phase from the perspective of Markov random fields  and Gaussian random fields . On the other hand, the process of construction of the graph basically involves two stages: the initial phase involves graph adjacency matrix construction, and the second phase deals with graph weight calculation. Blum and Chawla  experimented graph construction using k-nearest neighbor and ε nearest neighbor. The approach simply connects each node to all nodes to which the distance is at most ε. The most used functions for calculation of graph weights are: the Gaussian similarity function and the inverse Euclidean distance . We note that although graph-based methods are typically transductive, inductive graph-based methods do also exist in the literature, this line of work encompasses approaches that utilize the intrinsic relationship from both labeled and unlabeled samples to construct the graph to estimate a function . However, it is generally acknowledged that transductive graphs usually perform better than inductive ones . Another line of work, that has received recently much attention, is the scalable graph with SSL. A commonly used approach to cope with this issue is called anchor graph regularization . This model builds a regularization framework by exploring the underlying structure of the whole dataset with both datapoints and anchors. Liu et al.  provided a complete overview of approaches for making graph-based methods more scalable.
Extended supervised methods
These methods are direct extensions of traditional supervised learning methods to the semi-supervised setting. The most prominent examples of this class of methods are: (i) semi-supervised support vector machine and (ii) semi-supervised neural networks.
Mainstream models of semi-supervised SVM include many variants such as S3VM , TSVM , LapSVM , meanSVM , and S3VM based on cluster kernel . The related literature presents S3VM and TSVM as the two most popular variants. The optimal goal of S3VM is to build a classifier by using labeled data and unlabeled data. Similar to the idea of the standard SVM, S3VM requires the maximum margin to separate the labeled data and unlabeled data, and the new optimal classification boundary must satisfy that the classification on original unlabeled data has the smallest generalization error. TSVM exploits specific iterative algorithms which gradually search a reliable separating hyperplane (in the kernel space) with a transductive process that incorporates both labeled and unlabeled samples. Since their introduction, semi-supervised SVM models have evolved on different aspects and various approaches have proposed to improve existing variants or to create new ones .
Recently, numerous research efforts have been made to build an effective classification model using semi-supervised neural networks (SSNN) methods. The hierarchical nature of representations in DNN makes them a viable candidate for semi-supervised approaches. If deeper layers in the network express increasingly abstract representations of the input sample, one can argue that unlabeled data could be used to guide the network towards more informative abstract representations. A common strategy of this line of research is to train the DNN by simultaneously optimizing a standard supervised classification loss on labeled samples along with an additional unsupervised loss term imposed on either unlabeled data or both labeled and unlabeled data . The typical structure for such strategy is Ladder Networks , an autoencoder structure with skip connections from the encoder to decode. proposed by Rasmus et al. , this model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Prémont-Schwarz et al.  extended the Ladder Network architecture to the recurrent setting by adding connections between the encoders and decoders of successive instances of the network. A related group of SSNN methods is known as teacher- student models  where a single or an ensemble of teacher models are trained to predict on unlabeled data and the predicted labels are used to supervise the training of a student model. Thus, the teacher guides the student to approximate its performance under perturbations in the form of noises applied to the input and hidden layers of models. The teacher in the Teacher-Student structure can be summarized as being generated by an exponential moving average (EMA) of the student model. Various ways of applying the EMA lead to a variety of methods of this category. In the VAT Model  and the Π Model , the teacher shares the same weights as the student, which is equivalent to setting the averaging coefficient to zero. The Temporal Model  is similar to Π Model except that it also applies an EMA to accumulate the historical predictions. The Mean Teacher  applies an EMA to the student to obtain an ensemble teacher. There are other types of SSNN methods that are based on generative models , the primary goal of these methods is to model the process that generated new data, this technique will be reviewed in the “Data Augmentation” section.
Unsupervised representation Learning methods
The limited performance of data-hungry models when only a limited amount of labeled data is available for training has led to an increasing interest in literature to learn feature representations in an unsupervised fashion to solve learning tasks with insufficient labeled data. Unsupervised representation learning  encompasses a group of methods that make use of unlabeled data to learn a representation function f such that replacing data point x by feature vector f(x) in new classification tasks reduces the requirement for labeled data. Such learners seek to learn representations that are sufficiently generalizable to adapt to various learning tasks in future. In this case, the representations learned from unsupervised methods are usually assessed based on the performances of downstream classification tasks on top of these representations. Thus, the focus here is not on clustering or dimensionality reduction, but rather on learning unsupervised representations. Accordingly, we review in this subsection the recent progress and the most representative efforts on unsupervised representation learning methods. Generally, three groups of research fall under the umbrella of methods for training unsupervised representations, namely: (i) Transformation-Equivariant Representations, (ii) Self-supervised methods, and (iii) Generative Models.
The learning of Transformation-Equivariant Representations (TERs), was introduced by Hinton et al.  as the key idea of training capsule nets and has played a critical role in the success of Conventionnel Neural Networks (CNNs). It has been formalized afterward in various ways. Basically, TER learning seeks to model representations that equivary to various transformations on images by encoding their intrinsic visual structures. Then the successive problems for recognizing unseen visual concepts can be performed on top of the trained TER in an unsupervised fashion. Along this line of research, Group-Equivariant Convolutions (GEC)  have been proposed by directly training feature maps as a function of different transformation groups. The resultant feature maps are proved to equivary exactly with designated transformations. However, GEC have a restricted form of feature maps as a function of the considered transformation group, which limits the flexibility of its representation in many applications. Recently, Zhang et al.  proposed Auto-Encoding Transformations (AET), this form of TER guarantees more flexibility to enforcing transformation equivariance by maximizing the dependency between the resultant representations and the chosen transformations. Qi et al.  proposed later an alternative Auto-encoding Variational Transformation (AVT) model that reveals the connection between the transformations and representations by maximizing their mutual information.
Self-supervision is a form of unsupervised learning where the data provides the supervision. Broadly speaking, self-supervised learning converts an unsupervised learning problem into a supervised one by creating surrogate labels from the unlabeled dataset, potentially greatly reducing the number of labeled examples required . Currently, there are several techniques to achieve that, including Autoregressive models, such as PixelRNN , PixelCNN , and Transformer . These methods are trained by predicting the context, missing, or future data, they can generate useful unsupervised representations since the contexts from which the unseen parts of data are predicted often depend on the same shared latent representations. Generative models can also be considered as self-supervised, but with different goals: Generative models focus on creating diverse and realistic data, while self-supervised representation learning care about producing good features generally helpful for many tasks.
As for SSL, Auto-Encoders , Generative Adversarial Nets (GAN)  and many other generative models have been widely studied in unsupervised learning problems, from which compact representations can be learned to characterize the generative process for unlabeled data. By using an unsupervised fashion such models aim essentially at generating more data, this is why, as mentioned before, generative models are reviewed under the “Data Augmentation” strategy.
Another learning paradigm that has driven impressive advances in recent years without the need for gobs of real-world data is Reinforcement Learning (RL) .
RL is one step more data-efficient than supervised learning. In supervised learning, the learner learns from a labeled dataset with guidance. Whereas RL agent interacts with its environment, performs actions, and learns by a self-guided trial-and-error method . In other words, in the absence of a training dataset, RL agent is bound to learn from its experience. Seen from this perspective, RL algorithms can be viewed as an optimized-data alternative to supervised learning algorithms, since the sample complexity does not depend on preexisting data, but rather on the actions the agent takes and the dynamics of the environment .
One of the remarkable achievements of such learning paradigm is AlphaGo Zero , as given absolutely no prior data other than the game’s rules. With no other input, simply by playing against itself, AlphaGo Zero learned the game of Go better than any human or machine ever had. Another example is PILCO (Probabilistic Inference for Learning Control) , a model-based policy search method that propagates uncertainty through time for long-term planning and learns parameters of a feedback policy by means of gradient-based policy search. It achieved an unprecedented data efficiency for learning control policies from scratch (it requires only about 20 trials, experience of about 30 s), and is directly applicable to physical systems, e.g., robots.
Following the taxonomy of Arulkumaran et al.  two main RL approaches can be distinguished: (i) methods based on value functions which are based on estimating the value (expected return) of being in a given state. This approach forms the foundation of the state-action-reward-state-action (SARSA) algorithm , and Q-learning  the most commonly used RL algorithms. And (ii) methods based on policy search that do not need to maintain a value function model, but directly search for an optimal policy. There is also a hybrid, actor-critic approach, which employs both value functions and policy search. Between the two approaches, policy-based methods are known to be significantly more sample-efficient because they reuse data more effectively . For instance, Guided Policy Search  is very data-efficient as it uses trajectory optimization to direct policy learning and avoid poor local optima.
From the model perspective, RL algorithms can be categorized as (i) model based and (ii) model free depending on whether the agent has the access or learns a model of the environment . Having a model in hands allows the agent to plan ahead to predict state transitions and future rewards. Thus, If the model is correct, then the learning would be greatly benefited in terms of sample efficiency compared to model-free methods. Hence, model-based algorithms are taking the lead in terms of data efficiency as they try to derive a model of the environment and use that model for training the policy instead of data from real interactions (e.g., PILCO) .
Contemporary deep reinforcement learning (DRL) has led to tremendous advancements , but has also inherited shortcomings from the current generation of deep learning techniques that turned the paradigm of trial-and error-learning to a data-hungry model . Indeed, the combination requires humongous experience before becoming useful, it is even claimed that DRL hunger for data is even greater than supervised learning. This is why although DRL can potentially produce very complex and rich models, sometimes simpler, more data-efficient methods are preferable .
In fact, DRL excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment such as game settings. However, training DRL model with limited interaction environment such as production-scale, healthcare or recommender systems is challenging because of the expensiveness of interaction and limitation of budget at deployment. The recent wave of DRL research tried to address this issue, for instance, Botvinick et al.  suggested in its recent work two key DRL methods to mitigate the sample efficiency problem: episodic deep RL and meta-RL. Buckman et al.  proposed a stochastic ensemble value expansion (STEVE) to combine deep model-free and deep model-based approaches in RL in order to achieve the high performance of model-free algorithms with low sample complexity of model-based algorithms. To reduce the number of system interactions while simultaneously handling constraints, Kamthe et al.  introduced a model-based DRL framework based on probabilistic Model Predictive Control (MPC) with learned transition models using Gaussian processes. The proposed approach requires on average only six trials (18 s). Popov et al.  introduced two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG) for data-efficient DRL. They showed that by decoupling the frequency of network updates from the environment interaction, data-efficiency has substantially been improved. In a recent work, Schwarzer et al.  proposed Schwarzer Self-Predictive Representations (SPR), a method that makes use of self-supervised techniques along with data augmentation to train DLR in limited interaction environment. The model achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100 k steps of environment interaction, which represents, according to the authors, 55 % relative improvement over the previous state-of-the-art.
Ultimately, unlabeled data are expected to be a game-changer for AI to move forward beyond supervised, data-hungry models. While introducing his most recent research « SimCLR »  a framework for contrastive learning of visual representations that has achieved a tremendous performance leap in image recognition using unsupervised learning, AI pioneers Geoff Hinton quoted recently in AAAI 2020 Conference that « unsupervised learning is the right thing to do ». Appearing on the same AAAI stage, Turing Award winner Yann LeCun agreed that unsupervised learning, semi-supervised learning, or any model training that does not require manual data labeling are vital tools for the progress of ML and its applications. The literature is flourishing with a broad variety of semi-supervised and unsupervised algorithms (Fig. 6; Table 1 summarizes the key discussed methods). As a matter of fact, recently, both lines of research have strongly focused on DNN, particularly deep generative models that have been extensively used for self-supervision and have also been extended to the semi-supervised setting. However, despite the success of these methods, a considerable amount of empirical studies reveals that exploiting unlabeled data might deteriorate learning performance . The potential performance degradation caused by the introduction of unlabeled data is one of the most important issues to be resolved especially in SSL. Furthermore, we noted that the evaluation aspect has received relatively little attention in the literature. Pragmatic baselines to be used for empirically evaluating the performance of non-supervised learning methods in order to choose an approach that is well suited to a given situation are relatively rare. Recently, Oliver et al.  established a set of guidelines for the realistic evaluation of SSL algorithms. In turn, Palacio-Ninoe et al.  have proposed evaluation metrics for unsupervised learning algorithms. In recent works, there has been a notable shift towards automatic selection and configuration of learning algorithms for a given problem. However, while automating ML pipeline has been successfully applied to supervised learning , this technique is yet to be extended to the non-supervision settings.
To fight the data scarcity problem and to increase generalization, the literature suggests the use of Data Augmentation (DA) techniques. DA entails a set of methods that apply mutation in the original training data and synthetically creating new samples . It is routinely used in classification problems to reduce the “overfitting” caused by limited training data . Indeed, when a model is trained with a small training set, the trained model tends to overly fit to the samples in the training set and results in poor generalization. DA acts as a regularizer to combat this. Considered more and more as a vital and ubiquitous instrumental data processing step in modern ML pipelines, DA has become a subject of big interest in both academic and industrial settings. Contributions in this field are actively growing; new DA techniques emerge in a regular basis. Being unable to cover all existing techniques, based on the studied literature, we rather propose a classification of existing augmentation strategies hinge on four aspects: (i) Whether the mutation/transformation is handcraft or smart (learning-based), accordingly we distinguish between basic and generative augmentations. (ii) Whether the augmentation is performed in the data or the feature space, accordingly we distinguish between data-space and feature-space augmentations. (iii) Whether the data to be augmented are acquired or come from another similar dataset, accordingly we distinguish between in situ augmentation and borrowed augmentations. (iv) Whether the data to be augmented are labeled or unlabeled, accordingly we distinguish between supervised and unsupervised augmentations. In the following, we briefly introduce the main methods and review works that made the biggest impact in each class of augmentation.
Basic vs generative augmentations
The most popular and basic augmentation schema is the traditional transformations, the aim of this class of methods is to preserve the label of the data through simple transformations which can happen in realistic data. For image augmentation, for example, this can be achieved by performing geometric transformations (such as random flipping, cropping, translation, rotation…), or by changing color, brightness, or contrast (Fig. 7). Intuitively, a human observer can still recognize the semantic information in the transformed image, while for the learner it is perceived as new data. The manipulations applied to ImageNet , remains the standard for this class of technique. The model has been used extensively for various purposes since its development. Vast amounts of research have used it to benchmark their models against or as a base model to test new transformations. On the other hand, the MNIST (handwritten digit) dataset  is commonly augmented using elastic distortions , another transformation technique that mimics the variations in pen stroke caused by uncontrollable hand muscle oscillations. Yaegeret al.  also used the same technique for balancing class frequencies, by producing augmentations for under-represented classes. Mixing paring samples  proposed by Inoue et al. is another basic augmentation technique for image classification task, which can create the new image from an original one by overlaying another image randomly picked from the training set. Zhong et al.  introduced random erasing as a means to make models more robust to occlusion, by randomly erasing rectangular regions of the input image. Generally, basic class of augmentations has been proven to be fast, reproducible, and reliable technique with an ease implementation [122, 129]. However, it relies on simple and basic transformation functions, in some specific cases, this could result in further overfitting. This has prompted further investigation for new more advanced and powerful DA techniques that include learning algorithms in the augmentation process.
Motivated by the recent advance of generative models especially adversarial learning, Generative Adversarial Networks (GAN)  have been increasingly used for generating synthetic data. In a nutshell, in GAN based augmentation, two networks are trained to compete with each other, the Generator and the Discriminator, the first creates new data instances (typically an image) while the second evaluates them for authenticity (real or fake), this co-optimized process results in generating realistic synthesized data (Fig. 7). The result obtained using generative models differs from the one obtained by basic transformations. The latter modifies real data with some sort of predefined transformation functions while the former creates new synthetic data. The synthetic data need to be different enough from the original ones so that these variations lead to a better generalization capacity. In contrast to basic augmentation techniques which are limited to minor changes on data to not damage the semantic content. This makes generative augmentation similar to imagination or dreaming, it has a creative effect that makes it known for its artistic applications, but this schema also serves as a great tool for DA.
Basic GAN architectures are unable to create high-quality new samples. This is why the main contributions in GAN based augmentation are new architectures that modify the standard GAN framework through different network architectures, loss functions, evolutionary methods, and others to produce higher quality additional data. One of these variants is conditional GAN introduced by Odena et al.  in 2016 to generate data by controlling the random noise generation. Many extensions of conditional GANs have been proposed afterward. ACGAN (Auxiliary classifier GAN)  changed the GAN energy function to add the discrimination class error of the generated sample and the real sample. This variant demonstrates that a complex latent coder could boost the generative sample’s resolution. Antoniou et al. proposed DAGAN (Data Augmentation GAN)  that generates synthetic data using a lower-dimensional representation of a real image. The authors train a conditional GAN on unlabeled data to generate alternative versions of a given real image. Mariani et al. proposed BAGAN (balancing GAN)  as an augmentation tool to restore balance in imbalanced datasets. The use of non-conditional GANs to augment data directly has only very recently been explored. Karras et al. used PGGAN (Progressive Growing of GAN)  a stable architecture to training GAN models to generate large high-quality images that involves incrementally increasing the size of the model during training. This approach has proven effective at generating high-quality synthetic faces that are startlingly realistic. The DCGAN (deep conventional GAN)  is one of the successful network architectures for GANs. The main contribution of the DCGAN is the use of convolutional layers in the GAN framework which provides stable training in most cases and produces higher resolution images.
Rather than generating addition samples, another class of innovative variants of GAN attempts to translate data across domains, this consists of learning a mapping between data from a source domain (typically with large samples) and data from a similar target domain (with small samples), such as dogs to wolfs. This helps to compensate the domain with few samples by data from other related domains. Inpix2pix , a conditional GAN was used to learn a mapping from an input image to an output image; Inpix2pix learns a conditional generative model using paired images from source and target domains. CycleGAN (Cycle consistent adversarial networks) was proposed by Zhu et al.  for image-to-image translation tasks in the absence of paired examples through introducing the cycle consistency constraint. Similarly, Disco GAN  and Dual GAN  used an unsupervised learning approach for image-to-image translation based on unpaired data, but with different loss functions. CoGAN  is a model which also works on unpaired images, using two shared-weight generators to generate images of two domains with one random noise.
Another generative technique to synthesize data using neural networks is the so-called variational autoencoder (VAE). Originally proposed in , VAE can be seen as a generative model that learns a parametric latent space of the input domain from which new samples can be generated. This has been mostly exploited for image generation . However recently, VAEs have also been recently used for speech enhancement  and also for music sounds synthesis .
As reported by many scholars [122, 145], the primary problem with generative augmentations is that it is hard to generate data other than images, and even within image data setting it is very difficult to produce high-resolution output images. Moreover, like any ANN, GAN and VAE require a large amount of data to train and its model can be unstable or it can overfit. Thus, depending on how limited the initial dataset is, generative may not be a practical solution .
Data‐space vs feature‐space augmentation
Basic augmentations discussed above are applied to data in the input space, they are called “data warping” methods  as they generate additional samples through transformations applied in the data-space. The main challenge with such augmentation schemes is that they are often tuned manually by human experts. Hence, they are “application-dependent” (transformations are domain-specific) and they require domain expertise to validate the label integrity and to ensure that the newly generated data respects valid transformations (that would occur naturally in that domain).
On the other end of the spectrum, we have “synthetic over-sampling” methods, which create additional samples in feature-space. This class of techniques presents thus the advantage of being domain-agnostic, requiring no specialized knowledge, and can, therefore, be applied to many different types of problems [146, 147]. Synthetic Minority Over-sampling Technique (SMOTE)  is a well-known feature augmentation method which handles imbalanced dataset by joining the k nearest neighbors to form new instances. Adaptive Synthetic (ADASYN)  is similar to SMOTE, they function in the same way. By contrast, ADASYN adds a random small bias to the points after creating the samples to make them not linearly correlated with their parents, which increases the variance in the synthetic data. The fact that image datasets are often imbalanced poses an intense challenge for DA. Like SMOTE and ADASYN, a lot of work has emerged focusing on restoring the balance in imbalanced images while creating new samples. Milidiu et al.  proposed the Seismo Flow, a flow-based generative model to create synthetic samples, aiming to address the class imbalance. Shamsolmoali et al.  introduced a GAN variation called CapsAN that handles the class imbalance problem by coalescing two concurrent methods, GANs and capsule network. Lee et al.  showed that pre-training DNNs with semi-balanced data generated through augmentation-based over-sampling improves minority group performance.
Furthermore, by manipulating the vector representation of data within a learned feature space, a dataset can be augmented in a number of ways, DeVries and Taylor  discussed adding noise, interpolating, and extrapolating as useful forms of feature space augmentation, while Kumar et al.  studied six feature space DA methods to improve classification, including Upsampling, Random Perturbation, Conditional Variational Autoencoder, Linear Delta, Extrapolation and Delta-Encoder.
In situ augmentations vs borrowed augmentations
Common augmentation techniques described so far are self-sufficient, that is they make use of the available small data to generate larger dataset without the need for any external data. For this, we can consider them “In situ augmentations”. However, they only work under the assumption that some initial data are available in the first place. In scenarios where no primary data are available, previously discussed techniques are not applicable. A very human-like way to tackle this issue is to ask someone to lend you what you are missing (such as borrowing salt or pepper from a neighbor or asking a dress from a friend). Similarly, instead of being limited only to the available training data, a “Borrowed augmentations” schema -if we may call it- augments data by aggregating and adapting input-output pairs from similar but larger data sets. A typical application of this method is autonomous vehicle where training data can be transferred into a night-to-day scale, winter-to-summer, or rainy-to-sunny scale (Fig. 7). Basically, transforming samples from a dataset to another aims at learning the joint distribution of the two domains and finding transformations between them. This line of research addresses the problem of domain shift  known as the dataset bias problem, i.e. mismatch of the joint distribution of inputs between source and target domains. An early work  that addressed the problem, proposed to learn a regularized transformation using information-theoretic metric learning that maps data in the source domain to the target domain. This is considered one of the first studies of domain adaptation  in the context of object recognition. However, this approach requires labeled data from the target domain as the input consists of paired similar and dissimilar points between the source and the target domain. In contrast, Gopalan et al.  proposed a domain adaptation technique for an unsupervised setting, where data from the target domain is unlabeled. The domain shift, in this case, is obtained by generating intermediate subspaces between the source and target domain, and projecting both the source and target domain data onto the subspaces for recognition. Unsupervised domain adaptation has been largely investigated afterward [155,156,157,158]. Recently, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. Hence, most recent works regarding data transportation cross-domains are based on generative models. For instance, the aforementioned technique of image-to-image translation based on GANs is a successful example of such schema, other similar techniques include neural style transfer (translate images from one style to another) , Text-to-Image Translation , Audio-to-Image Generation , Text-to-Speech synthesis  … etc. By relying on GAN, other recent works made use of this model to boost performance. Wang et al.  proposed Transferring GANs (TGANs) which incorporate a fine-tuning technique into GAN, to train this latter with low-volume target data. Yamaguchi et al.  import data contained in an outer dataset to a target model by using a multi-domain learning GAN. Huang et al.  proposed AugGAN, a cross-domain adaptation network, which allows to directly benefit object detection by translating existing detection RGB data from its original domain other scenarios. As one may note, while most works address transferring data cross domain for image generation, the challenge is still modestly explored in other domains .
Supervised vs unsupervised data augmentation
Augmentations schemas are class-preserving transformations, they rely on labeled data (supervised augmentation). However, if getting more data is hard, getting more labeled data is harder. Whilst collecting unlabeled data is easier and cheaper as human effort is not needed for labeling, a major issue is how to augment data without labels. Typically, SSL and unsupervised methods discussed previously are the best candidates to address the issue. Remarkably, tackling the challenge of using unlabeled data has been the subject of relatively few works in the literature in comparison with supervised augmentation methods. In recent work, Xie et al.  showed that data augmentation can be performed on unlabeled data to significantly improve semi-supervised learning. Their model relies on a small amount of labeled examples to make correct predictions for some unlabeled data, from which the label information is propagated to augmented counterparts through the consistency loss. Aside cycle consistency regularization, the commonly used approach for augmenting smaller labeled datasets using larger unlabeled datasets is self-training or more generally co-training , as discussed in the previous strategy, this type of training relies on an iterative process that use pseudo-labels on unsupervised data to augment supervised training. Always with the goal of leveraging a large amount of unlabeled data and a much smaller amount of labeled data for training, others methods have been proposed in the literature using methods such as Temporal Ensembling , Mean Teacher , self-paced learning , and data programming .
To sum up, there no best augmentation schema, the choice of the technique to use depends on the application scenario. When no data is available, borrowed augmentations should be considered. When a large amount of unlabeled samples exists, unsupervised augmentations are the best choice. Fig. 7; Table 2 depicts the main reviewed DA techniques. However, it is noteworthy that there are very few studies in the literature that compare empirically the performance of the different augmentations. Wong et al.  compared data-space or feature-space and found that it was better to perform data augmentation in data-space, as long as label preserving transforms are known. Shijie et al.  compared generative methods with some basic transformations. They found the combinations of the two types of augmentation drive better performance. Indeed, the choice of combining augmentation techniques can result in massively inflated dataset sizes. However, this is not guaranteed to be advantageous, especially in very limited data setting, this could result in further overfitting . Furthermore, the classes of techniques described in this section are neither mutually exclusive nor exhaustive. That means depending on the complexity, the space, the domain, and the data annotability on which the augmentation occurs, techniques can belong to different classes. For example, generative augmentations like cycleGAN are used to implement image to image translation, which is a type of borrowed augmentations. GANs have been also exploited in the context of unsupervised augmentation. For instance, Wang et al.  proposed a variant of CycleGAN (DicycleGAN) that performs an unsupervised borrowed augmentation based on a generative model.
Regardless their numbers and capacities, current DA implementations remain manually designed. A key research question is then to find automatically the effective DA schema for a given dataset by searching in a large space of candidate transformations. State-of-the-art approaches to address this problem include TANDA a framework proposed by Ratner et al.  to learn augmentations based on GAN architecture. And, AutoAugment  demonstrated state-of-the-art performance using a reinforcement learning algorithm to search for an optimal augmentation technique amongst a constrained set of transformations with miscellaneous levels of distortions. Several subsequent works including RandAugment  and Adversarial AutoAugment  have been proposed to reduce the computational cost of AutoAugment, establishing new state-of-the-art performance on image classification benchmarks.
As noted several times before, DA has essentially been used to achieve nearly all state-of-the-art results for image data, particularly for medical imaging analysis. In this domain where high-quality supervised samples are generally scarce and fraught with legal concerns regarding patient privacy, image augmentation is considered a de facto technique [176,177,178]. Medical data suffer also from the so-called “p large, n small” problem (where p is the number of features and n is the number of samples), hence, some works  attempted to fight the curse of data dimensionality along with the curse of data scarcity by proposing a dimensionality reduction-based method that can be used for data augmentation. Unfortunately, dataset augmentation is not as straightforward to apply in other domains as it is for images. Current effort of exploring DA in others non-image domains includes mainly sound, speech, and text augmentation. In this vein, Schluter and Grill  investigated a variety of DA techniques for application to singing voice detection. Wei et al.  proposed a text augmentation technique for improving NLP application performance.
A common assumption in most ML algorithms states that the training and future (unknown) data must be drawn from the same data space and have to follow the same distribution  (as stressed before, following the PAC-learnability criteria, the distribution D must be stationary-see the "Background" section). This implies that when the task to be learned or its domain change, the model needs to be rebuilt from scratch using newly collected training data. This paradigm is called single task learning or isolated learning. The fundamental problem with this way of learning is that it does not consider any other related information or the previously learned knowledge to alleviate the need for training data for a giving task. This is in sharp contrast of how we humans learn. As discussed in the "Background" section, human learning is very knowledge-driven: we accumulate and maintain the knowledge learned from previous tasks and use it seamlessly in learning new tasks and solving new problems with little data and effort. Towards the ultimate goal of building machines that learn like humans, some research areas attempted to break the training data exclusive dependency by exploring the idea of using prior knowledge as additional inputs for ML models apart from standard training data. We characterize this family of approaches as knowledge sharing strategy. Depending on how, when and what extent of knowledge is shared, the research is conducted under different guises, however all approaches share the same spirit: reusing knowledge instead of relying solely on the tasks’ training data. Next, we investigate the four main ways of sharing knowledge found in the literature, namely (A) Transfer Learning, (B) Multi-Task-Learning, (C) Lifelong Learning, and (D) Meta-Learning.
Inspired by human beings’ capabilities to transfer knowledge across tasks, Transfer Learning (TL) aims to improve learning and minimize the amount of labeled samples required in a target task by leveraging knowledge from the source task. Following the Pan et al.  definition: given a source domain DS and a learning task TS, a target domain DT and a learning task TT, TL aims to help improve the learning of the target predictive function fT(.) in DT using the knowledge in DS and TS, where DS ≠ DT or TS ≠ TT. Accordingly, TL allows the tasks and distributions used in training (source) and testing (target) to be different. When the target and source domains are the same, i.e., DS = DT, and their learning tasks are the same, i.e., TS = TT, the learning problem becomes a traditional ML problem.
Surveys  and  proposed and discussed a taxonomy of TL which has been widely accepted and used. Depending on the availability of labeled data in source and/or target data, they distinguished between : (i) inductive TL, (ii) transductive TL and (iii) unsupervised TL, which correspond respectively to the case of having available labeled target domain data, the case of having labeled source and no labeled target domain data, and the case of having no labeled source and no labeled target domain data. Domain adaptation, the DA technique discussed before is a type of transductive TL in which the source task and the target task are the same but their domains are different. Furthermore, regardless of the availability of labeled and unlabeled data, TL problems can generally be categorized into two main classes : homogeneous transfer learning and heterogeneous transfer learning, the former category focused on generalization performance across the same domain representations, meaning that the samples in a source domain and those in a target domain share the same representation structure but follow different probability distributions, the majority of TL approaches belong to this category. In the latter category, the feature spaces between the source and target are nonequivalent and are generally non-overlapping, this case is more challenging as knowledge is available from source data but it is represented in a different way than that of the target. This method thus requires feature and/or label space transformations to bridge the gap for knowledge transfer, as well as handling the cross-domain data distribution differences.
The effectiveness of any transfer method depends on the source task and how it is related to the target task. A transfer method would produce positive transfer between appropriately related tasks, while negative transfer occurs when the source task is not sufficiently related to the target task or if the relationship is not well leveraged by the transfer method . Increasing positive transfer, and avoiding negative transfer is one of the major challenges in developing transfer methods.
TL methods in the literature share the same function: leveraging the knowledge in the source domain. Three classes of TL methods can be defined based on the type of the shared knowledge: instance, feature, or model (parameter), accordingly we can distinguish between: (i) instance-based TL approaches that reuse labeled data from the source domain by re-weighting or resampling instances to help to train a more precise model for a target learning task . (ii) feature-based TL approaches, the transfer in this type of approaches is operated in an abstracted “feature space” instead of the raw input space. The aim is to minimize domain divergence and reduce error rates by identifying good feature representations that can be utilized from the source to target domains . And Model-based TL, also known as parameter-based TL, here the transferred knowledge is encoded into model parameters, priors or model architectures. Therefore, the goal of this class of approaches is to discover what part of the model learned in the source domain can help the learning of the model for the target domain . Model-based TL is arguably the most frequently used method. Additionally, we also identified relational based TL where data are non-independent and identically distributed. The three main TL approaches implicitly assume that data instances are independent and identically distributed. However, in real-world scenarios often contain some structures among the data instances, leading to relational structures in these domains, like for example social network domain. A family of approaches called relational-based TL attempts to handle this issue by building a mapping of the relational knowledge between the source relational domain and the target relational domain .
In the studied literature, TL methods are used in the classic learning tasks including classification, regression, and clustering tasks, relatively fewer but impactful works have also handled TL for reinforcement learning . Success applications of TL include computer vision , NLP , and urban computing . Emerging and promising research lines in TL include (i) Hybrid-based approaches, TL solutions that focus on transferring knowledge through the combination of different TL methods, for instance by using both instances and shared parameters. This is relatively a new approach and a lot of interesting research is emerging . (ii) Deep transfer learning, as deep learning becomes a ubiquitous technique, researchers have begun to endow deep models with TL capabilities. The powerful expressive ability of deep learning has also been leveraged to extract and transfer knowledge such as the relationships among categories. Fine-tuning  is a glaring example of popular and effective technique for knowledge transfer in terms of model parameters based on pre-trained models. The knowledge distillation technique , which involves a teacher network and a student network, is also a good example of this line of work. (iii) Transitive TL , a new type of TL problem where the source and target domains have very few common factors, making most TL solutions invalid. Always by following the human learning model which can conduct transitive inference and learning, novel TL solutions have proposed to connect the source and target domains by one or more intermediate domains through some shared factors. (iv) AutoTL, addresses the issue of learning to transfer automatically . Wei et al.  proposed a transfer learning framework L2T that automatically explores the space of TL method candidates to discover and apply the optimal TL method that maximally improves the learning performance in the target domain.
If a TL method aims to improve the performance of the source task and target task simultaneously, we are dealing with a Multi-task learning (MTL) problem . MTL shares the general goal of leveraging knowledge across different tasks. However, unlike TL there is no distinction between source and target tasks, multiple related tasks each of which has insufficient labeled data to train a model independently, are learned jointly using a shared representation. The training data from the extra tasks serve then as inductive bias, acting in effect as constraints for the others, improving general accuracy, and the speed of learning. As a result, the performance of all tasks is improved at the same time with no task prioritized. MTL is clearly close to TL, in some literature it is even considered as a type of inductive TL , this is why it is generally acknowledged that MTL problem could be approached with TL methods, however the reverse is not possible . Some works investigated hybrid scenarios where new task is arrived when multiple tasks have been already learned jointly by some MTL method. This could be seen as MTL problem for old tasks and TL problem to leverage knowledge from the old tasks to the new task. Such setting is called asymmetric multi-task learning .
A variety of different methods has been used for MTL, basically to each nature of the learning task corresponds a different setting in MTL . Accordingly, (i) the multi-task supervised learning is based on training labeled data for each task. As for TL, researches in this area have been conducted on three categories, that are, (a) feature-based multi-task supervised learning, specifically the problem of feature-selection  and feature transformation . (b) Model-based multi-task supervised learning, notably, the low-Rank approach , the task clustering approach , and task relation learning approach . Finally, very modest contributions have been done on the third category, (c) instance-based multi-task supervised learning . (ii) In multi-task unsupervised learning, each task deals with discovering useful patterns in data. (iii) In multi-task semi-supervised learning, tasks based their predictions on labeled data as well as unlabeled data. (iv) In multi-task active learning, each task selects representative unlabeled data to query an oracle with the hope to reduce the labeling cost as much as possible. (v) In multi-task reinforcement learning, each task aims to maximize the cumulative reward by choosing actions. (vi) In multi-task multi-view learning, each task exploits multi-view data. Recent years witness extensive studies on streaming data, known as online multi-task learning , this class of methods is used when training data in multiple tasks arrive sequentially, hence (vii) in multi-task online learning, each task is to process sequential data.
In settings where MTL consists of tasks with different types including supervised learning, unsupervised learning, reinforcement learning…etc., the MTL is characterized as heterogeneous. In contrast to the homogeneous MTL which consists of tasks of the same type. Unless it is explicitly underlined, the default MTL setting is the homogeneous one .
Given the nature of its process, MTL has been studied under the decentralized settings where each machine learns a separate, but related, task. In this vein, multiple parallel and distributed MTL models have been introduced in the recent literature [209,210,211]. Recently, research in MTL using DNN has produced a wide spectrum of approaches that have yielded impressive results on some tasks and application such as image processing , NLP  and biomedicine . Conversely, there have been exciting results using MLT methods in DNN. Generally, there are two commonly used approaches to carrying out MTL in deep learning: hard and soft . Hard parameter sharing implies the sharing of hidden layers between all tasks, and the output layers are different. Soft parameter sharing gives each task its own model with its own parameters, where these model parameters have a regularized distance to facilitate the sharing of learning.
One of the long-standing challenges for both biological systems and computational models (especially ANN) is the stability-plasticity dilemma . The basic idea is that a learner requires plasticity for the integration of new knowledge, but also stability in order to prevent the forgetting of previous knowledge. The dilemma is that while both are desirable properties, the requirements of stability and plasticity are in conflict. Stability depends on preserving the structure of representations, plasticity depends on altering it. An appropriate balance is difficult to achieve. Generally, ANN models tend often to have excessive plasticity, a problem that is dramatically referred to as “catastrophic forgetting” (or “catastrophic interference”)  which basically means the loss or disruption of previously learned knowledge when a new task is learned. Recently, a number of approaches have been proposed to mitigate catastrophic forgetting. They aim to design models that are sensitive to, but not disrupted by, new data. These approaches are categorized as lifelong/continual learning (LL) approaches. LL embodies a knowledge sharing process as it makes use of prior knowledge from the past observed tasks to help continuously learning new/future tasks. Hence, LL studies scenarios where a large number of tasks come over time. Thus, to deal with the continuous stream of information, LL approaches include essentially two elements: (a) a retention strategy to sequentially retain previously learned knowledge and (b) a transfer mechanism to selectively transfer that knowledge when learning a new task. Most of the research effort in LL has focused primarily on how to retain knowledge, in doing so, the focus has been shifted to counter catastrophic forgetting. various approaches have been proposed in this sense including (i) architectural methods, (ii) regularization methods, and (iii) rehearsal methods . A high-level analysis of LL literature shows that since its introduction 25 years ago in , LL concept has mainly evolved in respect of the four-learning paradigms:
Lifelong supervised learning: Early contributions in this area were based on memory systems and neural networks. Thrun  proposed two memory-based learning methods: k-nearest neighbors and Shepard’s method. Although they are still used today, memory-based systems suffer from the drawback of large working memory requirements as they require explicit storage of old information . On neural networks level, initially, Thrun and Mitchell worked  on a LL approach called explanation-based neural networks EBNN. Since, Silver et al. have extensively work on the extension and the improvement of the neural network approaches through many works [221,222,223]. Furthermore, a lifelong naive bayesian classification technique was proposed by Chen et al. , which is applied to a sentiment analysis task. Ruvolo and Eaton  proposed an efficient LML algorithm (ELLA) to improve an MTL method to make it a LL method. Clingerman and Eaton  proposed GP-ELLA to support Gaussian processes in ELLA.
Lifelong unsupervised learning: Works in this area are mainly about lifelong topic modeling and lifelong information extraction. Lifelong Topic Modeling approaches extract knowledge from topic modeling results of many previous tasks and utilizes the knowledge to generate coherent topics in the new task related works in this vein include [227, 228]. As the process of information extraction is by nature continuous and cumulative, information extraction represents an evident area for applying LL. Significant works of this line of research include [229, 230].
Lifelong semi-supervised learning: The most well-known and impactful work in this area is NELL, which stands for Never-Ending Language Learner [231 − 230]. NELL is a lifelong semi-supervised learning system that has been reading the Web continuously for information extraction since January 2010, and it has accumulated millions of entities and relations.
Lifelong reinforcement learning: Thrun and Mitchell  first studied lifelong reinforcement learning for robot learning. Recently, many works have been proposed in this area due to the recent surge in research in RL after being successfully used in the computer program. Bou Ammar et al.  presented a policy gradient efficient lifelong reinforcement learning algorithm. Tessler et al.  proposed a lifelong learning system that transfers reusable skills to solve tasks in a video game. Rolnick et al.  introduced CLEAR, a replay-based method that greatly reduces catastrophic forgetting in multi-task reinforcement learning.
By analyzing the LL literature, we note that despite the first pioneering attempts and early speculations, research in this field has never been carried out extensively until the recent years. In their book, Chen et al.  emphasized some reasons behind the slow advancement. The main reason according to them is that ML research in the past 20 years focused only on statistical and algorithmic approaches. Moreover, much of the past ML research and applications focused on supervised learning using structured data, which are not easy for LL because there is little to be shared across tasks or domains. They also underlined the fact that many effective ML methods such as SVM and deep learning cannot easily use prior knowledge even if such knowledge exists. However recently as most of the limits caused by these factors have been exceeded, LL is becoming increasingly a rich area of scientific contributions and new approaches have emerged. Notably, continual learning in DNN  and lifelong interactive knowledge learning for chatbots . Still, we believe that existing LL literature does not sufficiently cover the evaluation aspect, that is what makes a LL system successful, how to compare existing LL algorithms, and what metrics are most useful to report. Hence, much more efforts are expected in the research area for years to come.
Meta-learning, or learning to learn (LTL), improves the learning of a new task by using meta-knowledge extracted across tasks . In a nutshell, LTL treats learning tasks as learning examples. It aims to improve the learning algorithm itself, given the experience of multiple learning episodes. Basically, in a meta-learning system, we distinguish the meta-learner, which is the model that learns across episodes, and the inner-learner, which is instantiated and trained inside an episode by the meta-learner. More specifically, the inner-learner model, typically a CNN classifier, is initialized, and then trained on the support set (e.g., the base training set). The algorithm used to train the inner-learner is defined by the meta-learner model. This latter, updates the inner-learner to be able to improve while solving a task in the classic way (base learning) with only a very small set of training examples. At the end of the episode, the meta-learner’s parameters are trained from the loss resulting from the task learning error . Thus, meta-learning is tightly linked to the process of collecting and exploiting meta-knowledge. Meta-knowledge collecting is performed by extracting algorithm configurations such as hyperparameter settings, pipeline compositions and/or network architectures, the resulting model evaluations, the learned model parameters, as well as measurable properties of the task itself, also known as meta-features. Then the meta-knowledge is transferred to guide the search for optimal models for new tasks .
From our perspective, we consider LTL a tool for knowledge sharing more than an approach of reusing knowledge per se. Indeed, in the scanned literature, LTL is usually introduced as a method to solve other knowledge-sharing scenarios. Particularly, LTL is commonly described as the de facto method to solve few-shot learning (FSL) problems , a regime where only few experiences are available. Therefore, we propose in the following to review LTL methods in respect of the three discussed approaches, namely: TL, MTL, and LL, while shedding light on FSL, the most popular instantiation of LTL in the field of supervised learning.
Meta-learning-based methods for FSL
As the name implies, FSL refers to the problem of learning a new concept or task with only a few training examples or no pre-labeled learning example . FSL is not a knowledge sharing approach itself, but it is an umbrella term encompassing techniques that make use of prior knowledge methods to deal with data scarcity scenarios. There are three main variants of FSL, (i) zero-shot learning , which deals with learning a task that has no associated labeled training samples, (ii) one-shot learning  where tasks are learned from a single example, and (iii) low shot learning, assumes that a handful (typically 2–5) labeled examples exist for target/novel classes. Recently, FSL has sparkled with several successful applications in literature including few-shot classification , few-shot object detection , semantic segmentation , and landmark prediction . Generally, existing FSL models fall into two main groups, (i) Hallucination-based methods (practically data augmentation) deal directly with the data scarcity by “learning to augment”, however DA could alleviate the issue, but does not solve it. In this section, we focus on the second group (ii) Meta-learning-based methods that tackle the FSL problem by “learning to learn”. The majority of this class of methods can be labeled as either a metric learning algorithm or as a gradient-based meta-learner.
Metric learning algorithm: These methods address the FSL problem by “learning to compare”. The basic idea of metric learning is to learn a distance function between data points (like images). It has proven to be very useful for solving FSL problem for classification tasks: instead of having to fine-tune on the support set (the few labeled images), metric learning algorithms classify query images by comparing them to the labeled images. Koch et al.  proposed the Siamese Neural Networks to solve few-shot image classification. Their model learns a siamese network by metric-learning losses from a source data, and reuses the network’s features for the target one-shot learning task. Vinyals et al.  proposed Matching Networks that use an episodic training mechanism. Snell et al.  introduced prototypical Networks that learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Sung et al.  proposed Relation Network that uses CNN-based relation module as a distance metric. Li et al.  designed a model named Covariance Metric Networks (CovaMNet) to exploit both the covariance representation and covariance metric based on the distribution consistency for the few-shot classification tasks. Wertheimer et al.  localized objects using a bounding box. Garcia et al.  used Graph Neural Network based model. Despite the rich contributions in this line of research, relation measure, that is how to robustly measure the relationship between a concept and a query image remains a key issue in this class of FSL methods.
Gradient-based Meta-Learning: These methods address the FSL problem by “learning to optimize”. They embed gradient-based optimization into the meta learner. More specifically, in such models, there is an inner- loop optimization process that is partially or fully parameterized with fully differentiable modules. The methods of this class differ according their choice of the meta-model algorithm. The most famous meta-learners in the literature are perhaps (i) Meta-LSTM introduced by Ravi & Larochelle , a meta-learner that uses a Long-Short-Term-Memory network to replace the stochastic gradient decent optimizer and the weight-update mechanism. And (ii) Model-Agnostic Meta-Learning (MAML)  is currently one of the most elegant and promising LTL algorithms. MAML provides a good initialization of a model’s parameters to achieve optimal fast learning on a new task with only a small number of gradient steps. This method is compatible with any model trained with gradient descent (model-agnostic), and has been shown to be effective in many classification and reinforcement learning applications. Following this line of work, many recent studies [315,316,317,318] focused on learning better initialization by adaptively learning task-dependent modifications. In these works, the inner-loop optimization is generally based on first-order optimizer algorithms such as SGD and Adam. A few recent studies propose optimizer-centric approaches [319,320,321], they are models that focus not only on adjusting the optimizer algorithm but on learning the inner optimizer itself.
Meta-learning in TL setting
There are many works in the literature that combined TL with LTL. Aiolli  proposed an approach to transfer learning based on meta kernel learning. Eshratifar et al.  propose a joint training approach that combines both TL and meta-learning loss functions into a learning algorithm. Sun et al.  proposed a novel FSL method called meta-transfer learning which learns to adapt a DNN for FSL tasks. Later, the authors introduced the hard task meta-batch scheme as a learning curriculum that further boosts the learning efficiency of the proposed meta transfer learning . Li et al.  bring forward a novel meta-transfer feature method (MetaTrans) for measuring the transferability among domains. Some of the recent applications of meta-transfer learning include meta-transfer learning for learning to disentangle causal mechanisms , meta-transfer learning for zero-shot super-resolution , meta-transfer learning for code-switched speech recognition , and meta transfer learning for adaptive vehicle tracking in UAV Videos .
Meta-learning in MTL setting
LTL has recently emerged as an important direction for developing algorithms for MTL. Indeed, meta-learning can be brought in to benefit MTL, notably by learning the relatedness between tasks or how to prioritize among multiple tasks. In this vein, Lin et al.  proposed an adaptive auxiliary task weighting to speed up training for reinforcement learning. Franceschi et al.  proposed a forward and reverse gradient-based hyperparameter optimization for learning task interactions. Epstein et al.  proposed a meta-learning framework for extracting sharing features among multiple tasks that are learned simultaneously. Chen et al.  used a shared meta-network to capture the meta-knowledge of semantic composition and generate the parameters of the task-specific semantic composition models in MLT setting. Amit et al.  proposed a PAC-Bayes meta-learning method designed for multi-task learning.
Meta-learning in LL setting
LL can also be realized through LTL. Riemer et al.  proposed a framework called Meta-Experience Replay (MER) that integrates meta-learning and experience replay for continual learning. Javed et al.  proposed OML, a meta-learning objective that directly minimizes catastrophic interference by learning representations that accelerate future learning and are robust to forgetting under online updates in continual learning. He et al.  proposed task agnostic continual learning framework based on meta-learning that is implemented by differentiating task specific-parameters from task agnostic parameters, where the latter are optimized in a continual meta-learning fashion, without access to multiple tasks at the same time. Munkhdalai et al.  introduced a meta-learning model called MetaNet that supports meta-level LL by allowing ANN to learn and to generalize a new task or concept from a single example on the fly. Vuiro et al.  proposed a meta-training scheme to optimize an algorithm for mitigating catastrophic forgetting. Xu et al.  described an LTL method to improve word embeddings for a lifelong domain without a large corpus.
In this section, by knowledge sharing we referred to all types of learning based on prior experiences with other tasks. When, how, and what knowledge is shared determinate the class of methods (Table 3 summarizes the reviewed class of methods). Nevertheless, throughout the literature, we noted a number of terminology inconsistencies. Phrases such as “transfer learning” and “multi-task learning” or “few-shot learning” and “meta-learning” are sometimes used interchangeably. This is often a source of confusion as the studied concepts are closely related and boundaries between them aren’t always clear. Certainly, the reviewed approaches are similar in their common goal of knowledge reuse, however, they are different in their specific ways to handle knowledge transfer (Fig. 8 highlight the transfer mechanism of each approach). TL improves the learning of a target task through the transfer of knowledge from a related source task that has already been learned. MTL considers how to learn multiple tasks in parallel, at the same time, and exploit their intrinsic relationship, such that they help each other to be learned better. LL is sequential learning that continually learns overtime by accommodating new knowledge while retaining previously learned experiences. Meta-learning transfers meta-knowledge across tasks, it can thus be considered a meta-solution to transfer knowledge in TL, MTL, and LL. FSL is a problem and not a solution, that studies learning tasks with a few experiences. Hence, reviewed knowledge sharing solutions can be used to solve this problem, particularly meta-learning approaches. Among the five concepts, TL is probably the largest one, as all reviewed approaches involve, at some level, transfer related operations. However, it is important to note that TL is unidirectional, its goal is to improve the learning of only the target task, learning of the source task(s) is irrelevant and not considered. Similarly, LL (in its vanilla version) only transfer knowledge forward to help future learning and do not go back to improve the model of previous tasks. While in MTL all tasks and data are provided together, allowing the model to be trained on and then to improve all tasks at the same time, but at a potentially high computational cost. Recently, backward or reverse knowledge transfer is increasingly studied in the context of LL . Furthermore, TL and MLT typically need only few similar tasks and do not require the retention of explicit knowledge. LL, on the other hand, needs significantly more previous tasks in order to learn and to accumulate a large amount of explicit knowledge so that the new learning task can select the suitable knowledge to be used to help the new learning. Hence, the growth of the number of tasks and knowledge retention are key characteristics of LL, this why many optimization efforts have been observed in the presented literature regarding these two aspects. On another note, meta-learning trains a meta-model from a large number of tasks to quickly adapt to a new task with only a few examples. It can be useful for better knowledge retention through metric learning or for measuring relatedness between tasks or to select the useful knowledge to be transferred. However, one key assumption made by most meta-learning techniques is that the training tasks and test/new tasks are from the same distribution, while other approaches do not make this assumption. This is a major weakness that can limit the scope of LTL application and which has to be seriously addressed in the future LTL research.
Despite the underlined differences, clearly, knowledge sharing approaches are closely related, they share many challenging issues that are expected to preoccupy the future literature in this field as well as key characteristics that allow them to work collaboratively and synergistically. For example, if we continuously apply TL in a learning system, we can obtain a lifelong machine learning system, inversely we can view TL as LL system in the particular case where the number of the tasks is two. On the other hand, LL could also be considered as online MTL where we deal with multiple tasks, and data points arrive in sequential order. Another special case of LL that is worth to be mentioned that at level, is curriculum learning . Similarly to MLT, in this case, all tasks and data are made available, but the problem is to identify the optimal order in which to train on data for the most efficient and effective learning. An intuitive type of curriculum is to learn tasks from “easy” to “hard” (similar to the way humans often learn new concepts). Another common characteristic is the regularization effect, knowledge sharing approaches, especially those dealing with multiple tasks, benefit from the effect regularization due to parameter sharing and of the diversity of the resulting shared representation. They also somehow implicitly augment data (e.g., domain adaptation).
On the other end of the spectrum, knowledge sharing approaches share also the same concerns. Notably, the effectiveness of all reviewed approaches depends on the task relatedness, defining task similarity is a key overarching challenge. As mentioned before, considerably less attention has been given to the rigorous evaluation to compare between methods of the same approach or between approaches of different nature. Also, dealing with knowledge implies to answer some important questions such as what forms of knowledge are important, how to represent them, and what kinds of reasoning capabilities are useful, since reasoning allows the system to infer new knowledge from existing knowledge, which can be used in the new task learning. However, so far, little research has been done to address these questions in knowledge sharing literature. Hence, we believe that research in knowledgeable systems needs more engagement and wider attention of academic researchers, more efforts are expected in order to bring this fields to maturity and make it able to compete classical paradigms of learning.
2.4 Hybrid learners
Data hungriness is mainly related to DNN when they are used in a supervised fashion, these models represent a branch of learning called connectionism. Another potential strategy to cure hungriness would be then to go out of the box and to look for other branches of learning that are more data-efficient. In his recent book, Domingos  has drawn borders between five schools of thoughts in ML, namely symbolists, connectionists, evolutionaries, bayesians, and analogizers. Driven by the same goal of building learning machines, each type of learner makes different assumptions about data. Evolutionaries take roots in evolutionary biology, they use genetic algorithms to deal with structure discovery problem. By being basically research and optimization algorithms, learners of this family require relatively less data. They are mainly used to optimize other hungry learners [277, 278] but they are known to be costly. Bayesians find their origins in statistics, they use probabilistic inference to cope with uncertainty. Algorithms of this family are mostly supervised such as SVM, accordingly they require a large amount of data. Similarly, analogizers also need data about the solution of a known situation to transfer it to a new situation faced using mainly Kernel machines, recommender systems are the most famous application of analogy-based learning. Generally, all three families obey the rule of “more data, better learning”. However, connectionists represented by ANN are without a doubt the most data-driven tribe, inspired by neuroscience this branch produces learning algorithms to find the connection weights that make it possible for a neural network to accomplish some intelligent task. Connectionism is generally associated with an empiricist position that considers all of mind as the result of learning and experience during life. According to connectionists experiences/data are the only sources of learning, the more data we have the more we can learn . On the other end of the spectrum, symbolists are arguably the most data–efficient tribe. Symbolists view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic. They presume that the world can be understood in the terms of structured representations and assume that intelligence can be achieved by the manipulation of symbols, through rules and logic operating on those symbols to encode knowledge . “Symbolic” AI is considered as the classic AI, it is sometimes referred to as GOFAI (Good Old-Fashioned AI). It was largely developed in an era with vastly less data and computational power than we have now. Symbolic AI bases its intelligent conclusions and decisions on the memorized facts and rules rather than raw massive data. However, it suffers from several drawbacks regarding generalization and change adaptation that, interestingly, are the strengths of connectionists models. The right move would be then to integrate connectionists models, which excels at perceptual classification, with symbolic systems, which excel at inference and abstraction. This movement is known in the literature as Neural-Symbolic Computing (NSC) .
NSC aims at integrating robust connectionist learning and sound symbolic reasoning. The idea is to build a strong hybrid AI model that can combine the reasoning power of rule-based software and the learning capabilities of neural networks. In a typical neural-symbolic system, knowledge is represented in symbolic form, whereas learning and reasoning are computed by a neural network. Hence, the symbolic component takes advantage of the neural network’s ability to process and analyze unstructured data. Meanwhile, the neural network also benefits from the reasoning power of the rule-based AI system, which enables it to learn new things with much less data. It is believed that this fusing would help to build a new class of hybrid AI systems with a non-zero-sum game conception that are much more powerful than the sum of their parts . It is also claimed that this way of perceiving intelligence is much more analogical to the brain that uses mechanisms operating in the two fashions . In that NSC is expected to bring scientists closer to achieving true artificial human intelligence.
The integration of the symbolic and connectionist paradigms has been pursued by a relatively small research community over the last two decades. Recently, with the strong penetration of DNN and the rise of complaints regarding explainability and data hungriness of these models. NSC has yielded several significant results that have shown to offer powerful alternatives for opaque data-hungry DNN. Yi et al.  proposed NS-VQA, neural-symbolic visual question answering approach that disentangles reasoning from visual perception and language understanding. The model uses DNN for inverse graphics and inverse language modeling, and a symbolic program executor to reason and answer questions. According to the authors, incorporating symbolic structure as prior knowledge offers three advantages: (i) robustness, (ii) interpretability, and (iii) data efficiency. They verified that the system performs well after learning on only a small number of training data. In the same vein, Vedantam et al.  also demonstrated that their neural-symbolic VQA model performs effectively in low data regime. Evans et al.  proposed a differentiable inductive logic framework which is a reimplementation of traditional Inductive Logic Programming (ILP) in an end-to-end differentiable architecture. The framework attempts to combine the advantages of ILP with the advantages of the neural network-based systems; a data-efficient induction system that is robust to noisy and ambiguous data, and that does not deteriorate when applied to small data.
Furthermore, the idea of neural-symbolic integration has also tempted knowledge transfer community. The idea is to extract symbolic knowledge from a related domain and transfer it to improve the learning in another domain, starting from a network that does not necessarily have to be instilled with background knowledge . In this vein, Silver  discussed the link between NSC and LL, he exposed an integrated framework for neural-symbolic integration and lifelong machine learning where the symbolic component helps to retain and/or consolidate existing knowledge. Hu et al.  proposed a self-transfer approach with symbolic-knowledge distillation. They developed an iterative distillation method that transfers the structured information of logic rules into the weights of neural networks. The transferring is done via a teacher training network constructed using the posterior regularization principle. The proposed framework is applicable to various types of neural architectures, including CNN for sentiment analysis, and RNN for named entity recognition.