1 Introduction

When we learn new skills, we rarely – if ever – start from scratch. We start from skills learned earlier in related tasks, reuse approaches that worked well before, and focus on what is likely worth trying based on experience [82]. With every skill learned, learning new skills becomes easier, requiring fewer examples and less trial-and-error. In short, we learn how to learn across tasks. Likewise, when building machine learning models for a specific task, we often build on experience with related tasks, or use our (often implicit) understanding of the behavior of machine learning techniques to help make the right choices.

The challenge in meta-learning is to learn from prior experience in a systematic, data-driven way. First, we need to collect meta-data that describe prior learning tasks and previously learned models. They comprise the exact algorithm configurations used to train the models, including hyperparameter settings, pipeline compositions and/or network architectures, the resulting model evaluations, such as accuracy and training time, the learned model parameters, such as the trained weights of a neural net, as well as measurable properties of the task itself, also known as meta-features. Second, we need to learn from this prior meta-data, to extract and transfer knowledge that guides the search for optimal models for new tasks. This chapter presents a concise overview of different meta-learning approaches to do this effectively.

The term meta-learning covers any type of learning based on prior experience with other tasks. The more similar those previous tasks are, the more types of meta-data we can leverage, and defining task similarity will be a key overarching challenge. Perhaps needless to say, there is no free lunch [57, 188]. When a new task represents completely unrelated phenomena, or random noise, leveraging prior experience will not be effective. Luckily, in real-world tasks, there are plenty of opportunities to learn from prior experience.

In the remainder of this chapter, we categorize meta-learning techniques based on the type of meta-data they leverage, from the most general to the most task-specific. First, in Sect. 2.2, we discuss how to learn purely from model evaluations. These techniques can be used to recommend generally useful configurations and configuration search spaces, as well as transfer knowledge from empirically similar tasks. In Sect. 2.3, we discuss how we can characterize tasks to more explicitly express task similarity and build meta-models that learn the relationships between data characteristics and learning performance. Finally, Sect. 2.4 covers how we can transfer trained model parameters between tasks that are inherently similar, e.g. sharing the same input features, which enables transfer learning [111] and few-shot learning [126] among others.

Note that while multi-task learning [25] (learning multiple related tasks simultaneously) and ensemble learning [35] (building multiple models on the same task), can often be meaningfully combined with meta-learning systems, they do not in themselves involve learning from prior experience on other tasks.

This chapter is based on a very recent survey article [176].

2 Learning from Model Evaluations

Consider that we have access to prior tasks t j ∈ T, the set of all known tasks, as well as a set of learning algorithms, fully defined by their configurations θ i ∈ Θ; here Θ represents a discrete, continuous, or mixed configuration space which can cover hyperparameter settings, pipeline components and/or network architecture components. P is the set of all prior scalar evaluations P i,j = P(θ i, t j) of configuration θ i on task t j, according to a predefined evaluation measure, e.g. accuracy, and model evaluation technique, e.g. cross-validation. P new is the set of known evaluations P i,new on a new task t new. We now want to train a meta-learner L that predicts recommended configurations \(\Theta ^{*}_{new}\) for a new task t new. The meta-learner is trained on meta-data P ∪P new. P is usually gathered beforehand, or extracted from meta-data repositories [174, 177]. P new is learned by the meta-learning technique itself in an iterative fashion, sometimes warm-started with an initial \({\mathbf {P}}_{new}^{\prime }\) generated by another method.

2.1 Task-Independent Recommendations

First, imagine not having access to any evaluations on t new, hence \({\mathbf {P}}_{new} = \varnothing \). We can then still learn a function \(f: \Theta \times T \rightarrow \{\theta ^{*}_{k}\}\), k = 1..K, yielding a set of recommended configurations independent of t new. These \(\theta ^{*}_{k}\) can then be evaluated on t new to select the best one, or to warm-start further optimization approaches, such as those discussed in Sect. 2.2.3.

Such approaches often produce a ranking, i.e. an ordered set \(\theta ^{*}_{k}\). This is typically done by discretizing Θ into a set of candidate configurations θ i, also called a portfolio, evaluated on a large number of tasks t j. We can then build a ranking per task, for instance using success rates, AUC, or significant wins [21, 34, 85]. However, it is often desirable that equally good but faster algorithms are ranked higher, and multiple methods have been proposed to trade off accuracy and training time [21, 134]. Next, we can aggregate these single-task rankings into a global ranking, for instance by computing the average rank [1, 91] across all tasks. When there is insufficient data to build a global ranking, one can recommend subsets of configurations based on the best known configurations for each prior task [70, 173], or return quasi-linear rankings [30].

To find the best θ ∗ for a task t new, never before seen, a simple anytime method is to select the top-K configurations [21], going down the list and evaluating each configuration on t new in turn. This evaluation can be halted after a predefined value for K, a time budget, or when a sufficiently accurate model is found. In time-constrained settings, it has been shown that multi-objective rankings (including training time) converge to near-optimal models much faster [1, 134], and provide a strong baseline for algorithm comparisons [1, 85].

A very different approach to the one above is to first fit a differentiable function f j(θ i) = P i,j on all prior evaluations of a specific task t j, and then use gradient descent to find an optimized configuration \(\theta ^{*}_{j}\) per prior task [186]. Assuming that some of the tasks t j will be similar to t new, those \(\theta ^{*}_{j}\) will be useful for warm-starting Bayesian optimization approaches.

2.2 Configuration Space Design

Prior evaluations can also be used to learn a better configuration space Θ∗. While again independent from t new, this can radically speed up the search for optimal models, since only the more relevant regions of the configuration space are explored. This is critical when computational resources are limited, and has proven to be an important factor in practical comparisons of AutoML systems [33].

First, in the functional ANOVA [67] approach, hyperparameters are deemed important if they explain most of the variance in algorithm performance on a given task. In [136], this was explored using 250,000 OpenML experiments with 3 algorithms across 100 datasets.

An alternative approach is to first learn an optimal hyperparameter default setting, and then define hyperparameter importance as the performance gain that can be achieved by tuning the hyperparameter instead of leaving it at that default value. Indeed, even though a hyperparameter may cause a lot of variance, it may also have one specific setting that always results in good performance. In [120], this was done using about 500,000 OpenML experiments on 6 algorithms and 38 datasets. Default values are learned jointly for all hyperparameters of an algorithm by first training surrogate models for that algorithm for a large number of tasks. Next, many configurations are sampled, and the configuration that minimizes the average risk across all tasks is the recommended default configuration. Finally, the importance (or tunability) of each hyperparameter is estimated by observing how much improvement can still be gained by tuning it.

In [183], defaults are learned independently from other hyperparameters, and defined as the configurations that occur most frequently in the top-K configurations for every task. In the case that the optimal default value depends on meta-features (e.g. the number of training instances or features), simple functions are learned that include these meta-features. Next, a statistical test defines whether a hyperparameter can be safely left at this default, based on the performance loss observed when not tuning a hyperparameter (or a set of hyperparameters), while all other parameters are tuned. This was evaluated using 118,000 OpenML experiments with 2 algorithms (SVMs and Random Forests) across 59 datasets.

2.3 Configuration Transfer

If we want to provide recommendations for a specific task t new, we need additional information on how similar t new is to prior tasks t j. One way to do this is to evaluate a number of recommended (or potentially random) configurations on t new, yielding new evidence P new. If we then observe that the evaluations P i,new are similar to P i,j, then t j and t new can be considered intrinsically similar, based on empirical evidence. We can include this knowledge to train a meta-learner that predicts a recommended set of configurations \(\Theta ^{*}_{new}\) for t new. Moreover, every selected \(\theta ^{*}_{new}\) can be evaluated and included in P new, repeating the cycle and collecting more empirical evidence to learn which tasks are similar to each other.

2.3.1 Relative Landmarks

A first measure for task similarity considers the relative (pairwise) performance differences, also called relative landmarks, RL a,b,j = P a,j − P b,j between two configurations θ a and θ b on a particular task t j [53]. Active testing [85] leverages these as follows: it warm-starts with the globally best configuration (see Sect. 2.2.1), calls it θ best, and proceeds in a tournament-style fashion. In each round, it selects the ‘competitor’ θ c that most convincingly outperforms θ best on similar tasks. It deems tasks to be similar if the relative landmarks of all evaluated configurations are similar, i.e., if the configurations perform similarly on both t j and t new then the tasks are deemed similar. Next, it evaluates the competitor θ c, yielding P c,new, updates the task similarities, and repeats. A limitation of this method is that it can only consider configurations θ i that were evaluated on many prior tasks.

2.3.2 Surrogate Models

A more flexible way to transfer information is to build surrogate models s j(θ i) = P i,j for all prior tasks t j, trained using all available P. One can then define task similarity in terms of the error between s j(θ i) and P i,new: if the surrogate model for t j can generate accurate predictions for t new, then those tasks are intrinsically similar. This is usually done in combination with Bayesian optimization (see Chap. 1) to determine the next θ i.

Wistuba et al. [187] train surrogate models based on Gaussian Processes (GPs) for every prior task, plus one for t new, and combine them into a weighted, normalized sum, with the (new) predicted mean μ defined as the weighted sum of the individual μ j’s (obtained from prior tasks t j). The weights of the μ j’s are computed using the Nadaraya-Watson kernel-weighted average, where each task is represented as a vector of relative landmarks, and the Epanechnikov quadratic kernel [104] is used to measure the similarity between the relative landmark vectors of t j and t new. The more similar t j is to t new, the larger the weight s j, increasing the influence of the surrogate model for t j.

Feurer et al. [45] propose to combine the predictive distributions of the individual Gaussian processes, which makes the combined model a Gaussian process again. The weights are computed following the agnostic Bayesian ensemble of Lacoste et al. [81], which weights predictors according to an estimate of their generalization performance.

Meta-data can also be transferred in the acquisition function rather than the surrogate model [187]. The surrogate model is only trained on P i,new, but the next θ i to evaluate is provided by an acquisition function which is the weighted average of the expected improvement [69] on P i,new and the predicted improvements on all prior P i,j. The weights of the prior tasks can again be defined via the accuracy of the surrogate model or via relative landmarks. The weight of the expected improvement component is gradually increased with every iteration as more evidence P i,new is collected.

2.3.3 Warm-Started Multi-task Learning

Another approach to relate prior tasks t j is to learn a joint task representation using P prior evaluations. In [114], task-specific Bayesian linear regression [20] surrogate models \(s_{j}(\theta _{i}^{z})\) are trained in a novel configuration θ z learned by a feedforward Neural Network NN(θ i) which learns a suitable basis expansion θ z of the original configuration θ in which linear surrogate models can accurately predict P i,new. The surrogate models are pre-trained on OpenML meta-data to provide a warm-start for optimizing NN(θ i) in a multi-task learning setting. Earlier work on multi-task learning [166] assumed that we already have a set of ‘similar’ source tasks t j. It transfers information between these t j and t new by building a joint GP model for Bayesian optimization that learns and exploits the exact relationship between the tasks. Learning a joint GP tends to be less scalable than building one GP per task, though. Springenberg et al. [161] also assumes that the tasks are related and similar, but learns the relationship between tasks during the optimization process using Bayesian Neural Networks. As such, their method is somewhat of a hybrid of the previous two approaches. Golovin et al. [58] assume a sequence order (e.g., time) across tasks. It builds a stack of GP regressors, one per task, training each GP on the residuals relative to the regressor below it. Hence, each task uses the tasks before it to define its priors.

2.3.4 Other Techniques

Multi-armed bandits [139] provide yet another approach to find the source tasks t j most related to t new [125]. In this analogy, each t j is one arm, and the (stochastic) reward for selecting (pulling) a particular prior task (arm) is defined in terms of the error in the predictions of a GP-based Bayesian optimizer that models the prior evaluations of t j as noisy measurements and combines them with the existing evaluations on t new. The cubic scaling of the GP makes this approach less scalable, though.

Another way to define task similarity is to take the existing evaluations P i,j, use Thompson Sampling [167] to obtain the optima distribution \(\rho ^{j}_{max}\), and then measure the KL-divergence [80] between \(\rho ^{j}_{max}\) and \(\rho ^{new}_{max}\) [124]. These distributions are then merged into a mixture distribution based on the similarities and used to build an acquisition function that predicts the next most promising configuration to evaluate. It is so far only evaluated to tune 2 SVM hyperparameters using 5 tasks.

Finally, a complementary way to leverage P is to recommend which configurations should not be used. After training surrogate models per task, we can look up which t j are most similar to t new, and then use s j(θ i) to discover regions of Θ where performance is predicted to be poor. Excluding these regions can speed up the search for better-performing ones. Wistuba et al. [185], do this using a task similarity measure based on the Kendall tau rank correlation coefficient [73] between the ranks obtained by ranking configurations θ i using P i,j and P i,new, respectively.

2.4 Learning Curves

We can also extract meta-data about the training process itself, such as how fast model performance improves as more training data is added. If we divide the training in steps s t, usually adding a fixed number of training examples every step, we can measure the performance P(θ i, t j, s t) = P i,j,t of configuration θ i on task t j after step s t, yielding a learning curve across the time steps s t. As discussed in Chap. 1, learning curves are also used to speed up hyperparameter optimization on a given task. In meta-learning, learning curve information is transferred across tasks.

While evaluating a configuration on new task t new, we can halt the training after a certain number of iterations r < t, and use the partially observed learning curve to predict how well the configuration will perform on the full dataset based on prior experience with other tasks, and decide whether to continue the training or not. This can significantly speed up the search for good configurations.

One approach is to assume that similar tasks yield similar learning curves. First, define a distance between tasks based on how similar the partial learning curves are: dist(t a, t b) = f(P i,a,t, P i,b,t) with t = 1, …, r. Next, find the k most similar tasks t 1…k and use their complete learning curves to predict how well the configuration will perform on the new complete dataset. Task similarity can be measured by comparing the shapes of the partial curves across all configurations tried, and the prediction is made by adapting the ‘nearest’ complete curve(s) to the new partial curve [83, 84]. This approach was also successful in combination with active testing [86], and can be sped up further by using multi-objective evaluation measures that include training time [134].

Interestingly, while several methods aim to predict learning curves during neural architecture search (see Chap. 3), as of yet none of this work leverages learning curves previously observed on other tasks.

3 Learning from Task Properties

Another rich source of meta-data are characterizations (meta-features) of the task at hand. Each task t j ∈ T is described with a vector m(t j) = (m j,1, …, m j,K) of K meta-features m j,k ∈ M, the set of all known meta-features. This can be used to define a task similarity measure based on, for instance, the Euclidean distance between m(t i) and m(t j), so that we can transfer information from the most similar tasks to the new task t new. Moreover, together with prior evaluations P, we can train a meta-learner L to predict the performance P i,new of configurations θ i on a new task t new.

3.1 Meta-Features

Table 2.1 provides a concise overview of the most commonly used meta-features, together with a short rationale for why they are indicative of model performance. Where possible, we also show the formulas to compute them. More complete surveys can be found in the literature [26, 98, 130, 138, 175].

Table 2.1 Overview of commonly used meta-features. Groups from top to bottom: simple, statistical, information-theoretic, complexity, model-based, and landmarkers. Continuous features X and target Y  have mean μ X, stdev σ X, variance \(\sigma ^{2}_{X}\). Categorical features X and class C have categorical values π i, conditional probabilities π i|j, joint probabilities π i,j, marginal probabilities π i+ =∑j π ij, entropy H(X) = −∑i π i+ log 2(π i+)

To build a meta-feature vector m(t j), one needs to select and further process these meta-features. Studies on OpenML meta-data have shown that the optimal set of meta-features depends on the application [17]. Many meta-features are computed on single features, or combinations of features, and need to be aggregated by summary statistics (min,max,μ,σ,quartiles,q 1… 4) or histograms [72]. One needs to systematically extract and aggregate them [117]. When computing task similarity, it is also important to normalize all meta-features [9], perform feature selection [172], or employ dimensionality reduction techniques (e.g. PCA) [17]. When learning meta-models, one can also use relational meta-learners [173] or case-based reasoning methods [63, 71, 92].

Beyond these general-purpose meta-features, many more specific ones were formulated. For streaming data one can use streaming landmarks [135, 137], for time series data one can compute autocorrelation coefficients or the slope of regression models [7, 121, 147], and for unsupervised problems one can cluster the data in different ways and extract properties of these clusters [159]. In many applications, domain-specific information can be leveraged as well [109, 156].

3.2 Learning Meta-Features

Instead of manually defining meta-features, we can also learn a joint representation for groups of tasks. One approach is to build meta-models that generate a landmark-like meta-feature representation M′ given other task meta-features M and trained on performance meta-data P, or f : M↦M′. Sun and Pfahringer [165] do this by evaluating a predefined set of configurations θ i on all prior tasks t j, and generating a binary metafeature m j,a,b ∈ M′ for every pairwise combination of configurations θ a and θ b, indicating whether θ a outperformed θ b or not, thus m′(t j) = (m j,a,b, m j,a,c, m j,b,c, …). To compute m new,a,b, meta-rules are learned for every pairwise combination (a,b), each predicting whether θ a will outperform θ b on task t j, given its other meta-features m(t j).

We can also learn a joint representation based entirely on the available P meta-data, i.e. f :P × Θ↦M′. We previously discussed how to do this with feed-forward neural nets [114] in Sect. 2.2.3. If the tasks share the same input space, e.g., they are images of the same resolution, one can also use deep metric learning to learn a meta-feature representation, for instance, using Siamese networks [75]. These are trained by feeding the data of two different tasks to two twin networks, and using the differences between the predicted and observed performance P i,new as the error signal. Since the model parameters between both networks are tied in a Siamese network, two very similar tasks are mapped to the same regions in the latent meta-feature space. They can be used for warm starting Bayesian hyperparameter optimization [75] and neural architecture search [2].

3.3 Warm-Starting Optimization from Similar Tasks

Meta-features are a very natural way to estimate task similarity and initialize optimization procedures based on promising configurations on similar tasks. This is akin to how human experts start a manual search for good models, given experience on related tasks.

First, starting a genetic search algorithm in regions of the search space with promising solutions can significantly speed up convergence to a good solution. Gomes et al. [59] recommend initial configurations by finding the k most similar prior tasks t j based on the L1 distance between vectors m(t j) and m(t new), where each m(t j) includes 17 simple and statistical meta-features. For each of the k most similar tasks, the best configuration is evaluated on t new, and used to initialize a genetic search algorithm (Particle Swarm Optimization), as well as Tabu Search. Reif et al. [129] follow a very similar approach, using 15 simple, statistical, and landmarking meta-features. They use a forward selection technique to find the most useful meta-features, and warm-start a standard genetic algorithm (GAlib) with a modified Gaussian mutation operation. Variants of active testing (see Sect. 2.2.3) that use meta-features were also tried [85, 100], but did not perform better than the approaches based on relative landmarks.

Also model-based optimization approaches can benefit greatly from an initial set of promising configurations. SCoT [9] trains a single surrogate ranking model f : M × Θ→ R, predicting the rank of θ i on task t j. M contains 4 meta-features (3 simple ones and one based on PCA). The surrogate model is trained on all the rankings, including those on t new. Ranking is used because the scale of evaluation values can differ greatly between tasks. A GP regression converts the ranks to probabilities to do Bayesian optimization, and each new P i,new is used to retrain the surrogate model after every step.

Schilling et al. [148] use a modified multilayer perceptron as a surrogate model, of the form s j(θ i, m(t j), b(t j)) = P i,j where m(t j) are the meta-features and b(t j) is a vector of j binary indications which are 1 if the meta-instance is from t j and 0 otherwise. The multi-layer perceptron uses a modified activation function based on factorization machines [132] in the first layer, aimed at learning a latent representation for each task to model task similarities. Since this model cannot represent uncertainties, an ensemble of 100 multilayer perceptrons is trained to get predictive means and simulate variances.

Training a single surrogate model on all prior meta-data is often less scalable. Yogatama and Mann [190] also build a single Bayesian surrogate model, but only include tasks similar to t new, where task similarity is defined as the Euclidean distance between meta-feature vectors consisting of 3 simple meta-features. The P i,j values are standardized to overcome the problem of different scales for each t j. The surrogate model learns a Gaussian process with a specific kernel combination on all instances.

Feurer et al. [48] offer a simpler, more scalable method that warm-starts Bayesian optimization by sorting all prior tasks t j similar to [59], but including 46 simple, statistical, and landmarking meta-features, as well as H(C). The t best configurations on the d most similar tasks are used to warm-start the surrogate model. They search over many more hyperparameters than earlier work, including preprocessing steps. This warm-starting approach was also used in later work [46], which is discussed in detail in Chap. 6.

Finally, one can also use collaborative filtering to recommend promising configurations [162]. By analogy, the tasks t j (users) provide ratings (P i,j) for the configurations θ i (items), and matrix factorization techniques are used to predict unknown P i,j values and recommend the best configurations for any task. An important issue here is the cold start problem, since the matrix factorization requires at least some evaluations on t new. Yang et al. [189] use a D-optimal experiment design to sample an initial set of evaluations P i,new. They predict both the predictive performance and runtime, to recommend a set of warm-start configurations that are both accurate and fast. Misir and Sebag [102, 103] leverage meta-features to solve the cold start problem. Fusi et al. [54] also use meta-features, following the same procedure as [46], and use a probabilistic matrix factorization approach that allows them to perform Bayesian optimization to further optimize their pipeline configurations θ i. This approach yields useful latent embeddings of both the tasks and configurations, in which the bayesian optimization can be performed more efficiently.

3.4 Meta-Models

We can also learn the complex relationship between a task’s meta-features and the utility of specific configurations by building a meta-model L that recommends the most useful configurations \(\Theta ^{*}_{new}\) given the meta-features M of the new task t new. There exists a rich body of earlier work [22, 56, 87, 94] on building meta-models for algorithm selection [15, 19, 70, 115] and hyperparameter recommendation [4, 79, 108, 158]. Experiments showed that boosted and bagged trees often yielded the best predictions, although much depends on the exact meta-features used [72, 76].

3.4.1 Ranking

Meta-models can also generate a ranking of the top-K most promising configurations. One approach is to build a k-nearest neighbor (kNN) meta-model to predict which tasks are similar, and then rank the best configurations on these similar tasks [23, 147]. This is similar to the work discussed in Sect. 2.3.3, but without ties to a follow-up optimization approach. Meta-models specifically meant for ranking, such as predictive clustering trees [171] and label ranking trees [29] were also shown to work well. Approximate Ranking Tree Forests (ART Forests) [165], ensembles of fast ranking trees, prove to be especially effective, since they have ‘built-in’ meta-feature selection, work well even if few prior tasks are available, and the ensembling makes the method more robust. autoBagging [116] ranks Bagging workflows including four different Bagging hyperparameters, using an XGBoost-based ranker, trained on 140 OpenML datasets and 146 meta-features. Lorena et al. [93] recommends SVM configurations for regression problems using a kNN meta-model and a new set of meta-features based on data complexity.

3.4.2 Performance Prediction

Meta-models can also directly predict the performance, e.g. accuracy or training time, of a configuration on a given task, given its meta-features. This allows us to estimate whether a configuration will be interesting enough to evaluate in any optimization procedure. Early work used linear regression or rule-base regressors to predict the performance of a discrete set of configurations and then rank them accordingly [14, 77]. Guerra et al. [61] train an SVM meta-regressor per classification algorithm to predict its accuracy, under default settings, on a new task t new given its meta-features. Reif et al. [130] train a similar meta-regressor on more meta-data to predict its optimized performance. Davis et al. [32] use a MultiLayer Perceptron based meta-learner instead, predicting the performance of a specific algorithm configuration.

Instead of predicting predictive performance, a meta-regressor can also be trained to predict algorithm training/prediction time, for instance, using an SVM regressor trained on meta-features [128], itself tuned via genetic algorithms [119]. Yang et al. [189] predict configuration runtime using polynomial regression, based only on the number of instances and features. Hutter et al. [68] provide a general treatise on predicting algorithm runtime in various domains.

Most of these meta-models generate promising configurations, but don’t actually tune these configurations to t new themselves. Instead, the predictions can be used to warm-start or guide any other optimization technique, which allows for all kinds of combinations of meta-models and optimization techniques. Indeed, some of the work discussed in Sect. 2.3.3 can be seen as using a distance-based meta-model to warm-start Bayesian optimization [48, 54] or evolutionary algorithms [59, 129]. In principle, other meta-models could be used here as well.

Instead of learning the relationship between a task’s meta-features and configuration performance, one can also build surrogate models predicting the performance of configurations on specific tasks [40]. One can then learn how to combine these per-task predictions to warm-start or guide optimization techniques on a new task t new [45, 114, 161, 187], as discussed in Sect. 2.2.3. While meta-features could also be used to combine per-task predictions based on task similarity, it is ultimately more effective to gather new observations P i,new, since these allow us to refine the task similarity estimates with every new observation [47, 85, 187].

3.5 Pipeline Synthesis

When creating entire machine learning pipelines [153], the number of configuration options grows dramatically, making it even more important to leverage prior experience. One can control the search space by imposing a fixed structure on the pipeline, fully described by a set of hyperparameters. One can then use the most promising pipelines on similar tasks to warm-start a Bayesian optimization [46, 54].

Other approaches give recommendations for certain pipeline steps [118, 163], and can be leveraged in larger pipeline construction approaches, such as planning [55, 74, 105, 184] or evolutionary techniques [110, 164]. Nguyen et al. [105] construct new pipelines using a beam search focussed on components recommended by a meta-learner, and is itself trained on examples of successful prior pipelines. Bilalli et al. [18] predict which pre-processing techniques are recommended for a given classification algorithm. They build a meta-model per target classification algorithm that, given the t new meta-features, predicts which preprocessing technique should be included in the pipeline. Similarly, Schoenfeld et al. [152] build meta-models predicting when a preprocessing algorithm will improve a particular classifier’s accuracy or runtime.

AlphaD3M [38] uses a self-play reinforcement learning approach in which the current state is represented by the current pipeline, and actions include the addition, deletion, or replacement of pipeline components. A Monte Carlo Tree Search (MCTS) generates pipelines, which are evaluated to train a recurrent neural network (LSTM) that can predict pipeline performance, in turn producing the action probabilities for the MCTS in the next round. The state description also includes meta-features of the current task, allowing the neural network to learn across tasks. Mosaic [123] also generates pipelines using MCTS, but instead uses a bandits-based approach to select promising pipelines.

3.6 To Tune or Not to Tune?

To reduce the number of configuration parameters to be optimized, and to save valuable optimization time in time-constrained settings, meta-models have also been proposed to predict whether or not it is worth tuning a given algorithm given the meta-features of the task at hand [133] and how much improvement we can expect from tuning a specific algorithm versus the additional time investment [144]. More focused studies on specific learning algorithms yielded meta-models predicting when it is necessary to tune SVMs [96], what are good default hyperparameters for SVMs given the task (including interpretable meta-models) [97], and how to tune decision trees [95].

4 Learning from Prior Models

The final type of meta-data we can learn from are prior machine learning models themselves, i.e., their structure and learned model parameters. In short, we want to train a meta-learner L that learns how to train a (base-) learner l new for a new task t new, given similar tasks t j ∈ T and the corresponding optimized models \(l_{j} \in \mathcal {L}\), where \(\mathcal {L}\) is the space of all possible models. The learner l j is typically defined by its model parameters W = {w k}, k = 1…K and/or its configuration θ i ∈ Θ.

4.1 Transfer Learning

In transfer learning [170], we take models trained on one or more source tasks t j, and use them as starting points for creating a model on a similar target task t new. This can be done by forcing the target model to be structurally or otherwise similar to the source model(s). This is a generally applicable idea, and transfer learning approaches have been proposed for kernel methods [41, 42], parametric Bayesian models [8, 122, 140], Bayesian networks [107], clustering [168] and reinforcement learning [36, 62]. Neural networks, however, are exceptionally suitable for transfer learning because both the structure and the model parameters of the source models can be used as a good initialization for the target model, yielding a pre-trained model which can then be further fine-tuned using the available training data on t new [11, 13, 24, 169]. In some cases, the source network may need to be modified before transferring it [155]. We will focus on neural networks in the remainder of this section.

Especially large image datasets, such as ImageNet [78], have been shown to yield pre-trained models that transfer exceptionally well to other tasks [37, 154]. However, it has also been shown that this approach doesn’t work well when the target task is not so similar [191]. Rather than hoping that a pre-trained model ‘accidentally’ transfers well to a new problem, we can purposefully imbue meta-learners with an inductive bias (learned from many similar tasks) that allows them to learn new tasks much faster, as we will discuss below.

4.2 Meta-Learning in Neural Networks

An early meta-learning approach is to create recurrent neural networks (RNNs) able to modify their own weights [149, 150]. During training, they use their own weights as additional input data and observe their own errors to learn how to modify these weights in response to the new task at hand. The updating of the weights is defined in a parametric form that is differentiable end-to-end and can jointly optimize both the network and training algorithm using gradient descent, yet is also very difficult to train. Later work used reinforcement learning across tasks to adapt the search strategy [151] or the learning rate for gradient descent [31] to the task at hand.

Inspired by the feeling that backpropagation is an unlikely learning mechanism for our own brains, Bengio et al. [12] replace backpropagation with simple biologically-inspired parametric rules (or evolved rules [27]) to update the synaptic weights. The parameters are optimized, e.g. using gradient descent or evolution, across a set of input tasks. Runarsson and Jonsson [142] replaced these parametric rules with a single layer neural network. Santoro et al. [146] instead use a memory-augmented neural network to learn how to store and retrieve ‘memories’ of prior classification tasks. Hochreiter et al. [65] use LSTMs [66] as a meta-learner to train multi-layer perceptrons.

Andrychowicz et al. [6] also replace the optimizer, e.g. stochastic gradient descent, with an LSTM trained on multiple prior tasks. The loss of the meta-learner (optimizer) is defined as the sum of the losses of the base-learners (optimizees), and optimized using gradient descent. At every step, the meta-learner chooses the weight update estimated to reduce the optimizee’s loss the most, based on the learned model weights {w k} of the previous step as well as the current performance gradient. Later work generalizes this approach by training an optimizer on synthetic functions, using gradient descent [28]. This allows meta-learners to optimize optimizees even if these do not have access to gradients.

In parallel, Li and Malik [89] proposed a framework for learning optimization algorithms from a reinforcement learning perspective. It represents any particular optimization algorithm as a policy, and then learns this policy via guided policy search. Follow-up work [90] shows how to leverage this approach to learn optimization algorithms for (shallow) neural networks.

The field of neural architecture search includes many other methods that build a model of neural network performance for a specific task, for instance using Bayesian optimization or reinforcement learning. See Chap. 3 for an in-depth discussion. However, most of these methods do not (yet) generalize across tasks and are therefore not discussed here.

4.3 Few-Shot Learning

A particularly challenging meta-learning problem is to train an accurate deep learning model using only a few training examples, given prior experience with very similar tasks for which we have large training sets available. This is called few-shot learning. Humans have an innate ability to do this, and we wish to build machine learning agents that can do the same [82]. A particular example of this is ‘K-shot N-way’ classification, in which we are given many examples (e.g., images) of certain classes (e.g., objects), and want to learn a classifier l new able to classify N new classes using only K examples of each.

Using prior experience, we can, for instance, learn a common feature representation of all the tasks, start training l new with a better model parameter initialization W init and acquire an inductive bias that helps guide the optimization of the model parameters, so that l new can be trained much faster than otherwise possible.

Earlier work on one-shot learning is largely based on hand-engineered features [10, 43, 44, 50]. With meta-learning, however, we hope to learn a common feature representation for all tasks in an end-to-end fashion.

Vinyals et al. [181] state that, to learn from very little data, one should look to non-parameteric models (such as k-nearest neighbors), which use a memory component rather than learning many model parameters. Their meta-learner is a Matching Network that applies the idea of a memory component in a neural net. It learns a common representation for the labelled examples, and matches each new test instance to the memorized examples using cosine similarity. The network is trained on minibatches with only a few examples of a specific task each.

Snell et al. [157] propose Prototypical Networks, which map examples to a p-dimensional vector space such that examples of a given output class are close together. It then calculates a prototype (mean vector) for every class. New test instances are mapped to the same vector space and a distance metric is used to create a softmax over all possible classes. Ren et al. [131] extend this approach to semi-supervised learning.

Ravi and Larochelle [126] use an LSTM-based meta-learner to learn an update rule for training a neural network learner. With every new example, the learner returns the current gradient and loss to the LSTM meta-learner, which then updates the model parameters {w k} of the learner. The meta-learner is trained across all prior tasks.

Model-Agnostic Meta-Learning (MAML) [51], on the other hand, does not try to learn an update rule, but instead learns a model parameter initialization W init that generalizes better to similar tasks. Starting from a random {w k}, it iteratively selects a batch of prior tasks, and for each it trains the learner on K examples to compute the gradient and loss (on a test set). It then backpropagates the meta-gradient to update the weights {w k} in the direction in which they would have been easier to update. In other words, after each iteration, the weights {w k} become a better W init to start finetuning any of the tasks. Finn and Levine [52] also argue that MAML is able to approximate any learning algorithm when using a sufficiently deep fully connected ReLU network and certain losses. They also conclude that the MAML initializations are more resilient to overfitting on small samples, and generalize more widely than meta-learning approaches based on LSTMs.

REPTILE [106] is an approximation of MAML that executes stochastic gradient descent for K iterations on a given task, and then gradually moves the initialization weights in the direction of the weights obtained after the K iterations. The intuition is that every task likely has more than one set of optimal weights \(\{w^{*}_{i}\}\), and the goal is to find a W init that is close to at least one of those \(\{w^{*}_{i}\}\) for every task.

Finally, we can also derive a meta-learner from a black-box neural network. Santoro et al. [145] propose Memory-Augmented Neural Networks (MANNs), which train a Neural Turing Machine (NTM) [60], a neural network with augmented memory capabilities, as a meta-learner. This meta-learner can then memorize information about previous tasks and leverage that to learn a learner l new. SNAIL [101] is a generic meta-learner architecture consisting of interleaved temporal convolution and causal attention layers. The convolutional networks learn a common feature vector for the training instances (images) to aggregate information from past experiences. The causal attention layers learn which pieces of information to pick out from the gathered experience to generalize to new tasks.

Overall, the intersection of deep learning and meta-learning proves to be particular fertile ground for groundbreaking new ideas, and we expect this field to become more important over time.

4.4 Beyond Supervised Learning

Meta-learning is certainly not limited to (semi-)supervised tasks, and has been successfully applied to solve tasks as varied as reinforcement learning, active learning, density estimation and item recommendation. The base-learner may be unsupervised while the meta-learner is supervised, but other combinations are certainly possible as well.

Duan et al. [39] propose an end-to-end reinforcement learning (RL) approach consisting of a task-specific fast RL algorithm which is guided by a general-purpose slow meta-RL algorithm. The tasks are interrelated Markov Decision Processes (MDPs). The meta-RL algorithm is modeled as an RNN, which receives the observations, actions, rewards and termination flags. The activations of the RNN store the state of the fast RL learner, and the RNN’s weights are learned by observing the performance of fast learners across tasks.

In parallel, Wang et al. [182] also proposed to use a deep RL algorithm to train an RNN, receiving the actions and rewards of the previous interval in order to learn a base-level RL algorithm for specific tasks. Rather than using relatively unstructured tasks such as random MDPs, they focus on structured task distributions (e.g., dependent bandits) in which the meta-RL algorithm can exploit the inherent task structure.

Pang et al. [112] offer a meta-learning approach to active learning (AL). The base-learner can be any binary classifier, and the meta-learner is a deep RL network consisting of a deep neural network that learns a representation of the AL problem across tasks, and a policy network that learns the optimal policy, parameterized as weights in the network. The meta-learner receives the current state (the unlabeled point set and base classifier state) and reward (the performance of the base classifier), and emits a query probability, i.e. which points in the unlabeled set to query next.

Reed et al. [127] propose a few-shot approach for density estimation (DE). The goal is to learn a probability distribution over a small number of images of a certain concept (e.g., a handwritten letter) that can be used to generate images of that concept, or compute the probability that an image shows that concept. The approach uses autoregressive image models which factorize the joint distribution into per-pixel factors. Usually these are conditioned on (many) examples of the target concept. Instead, a MAML-based few-shot learner is used, trained on examples of many other (similar) concepts.

Finally, Vartak et al. [178] address the cold-start problem in matrix factorization. They propose a deep neural network architecture that learns a (base) neural network whose biases are adjusted based on task information. While the structure and weights of the neural net recommenders remain fixed, the meta-learner learns how to adjust the biases based on each user’s item history.

All these recent new developments illustrate that it is often fruitful to look at problems through a meta-learning lens and find new, data-driven approaches to replace hand-engineered base-learners.

5 Conclusion

Meta-learning opportunities present themselves in many different ways, and can be embraced using a wide spectrum of learning techniques. Every time we try to learn a certain task, whether successful or not, we gain useful experience that we can leverage to learn new tasks. We should never have to start entirely from scratch. Instead, we should systematically collect our ‘learning experiences’ and learn from them to build AutoML systems that continuously improve over time, helping us tackle new learning problems ever more efficiently. The more new tasks we encounter, and the more similar those new tasks are, the more we can tap into prior experience, to the point that most of the required learning has already been done beforehand. The ability of computer systems to store virtually infinite amounts of prior learning experiences (in the form of meta-data) opens up a wide range of opportunities to use that experience in completely new ways, and we are only starting to learn how to learn from prior experience effectively. Yet, this is a worthy goal: learning how to learn any task empowers us far beyond knowing how to learn any specific task.