Stateless Neural Meta-Learning using Second-Order Gradients

Deep learning typically requires large data sets and much compute power for each new problem that is learned. Meta-learning can be used to learn a good prior that facilitates quick learning, thereby relaxing these requirements so that new tasks can be learned quicker; two popular approaches are MAML and the meta-learner LSTM. In this work, we compare the two and formally show that the meta-learner LSTM subsumes MAML. Combining this insight with recent empirical findings, we construct a new algorithm (dubbed TURTLE) which is simpler than the meta-learner LSTM yet more expressive than MAML. TURTLE outperforms both techniques at few-shot sine wave regression and image classification on miniImageNet and CUB without any additional hyperparameter tuning, at a computational cost that is comparable with second-order MAML. The key to TURTLE's success lies in the use of second-order gradients, which also significantly increases the performance of the meta-learner LSTM by 1-6% accuracy.


Introduction
Humans learn new tasks quickly.While deep neural networks have demonstrated human or even super-human performance on various tasks such as image recognition (Krizhevsky et al., 2012;He et al., 2015) and game-playing (Mnih et al., 2015;Silver et al., 2016), learning a new task is generally slow and requires large amounts of data (LeCun et al., 2015).This limits their applicability in real-world domains where few data and limited computational resources are available.
Meta-learning (Schmidhuber, 1987;Schaul & Schmidhuber, 2010) is one approach to address this issue.The idea is to learn at two different levels of abstraction: at the outerlevel (across tasks), we learn a prior that facilitates faster learning at the inner-level (single task) (Vilalta & Drissi, 2002;Vanschoren, 2018;Hospedales et al., 2020;Huisman et al., 2021).The prior that we learn at the outer-level can take on many different forms, such as the learning rule (Andrychowicz et al., 2016;Ravi & Larochelle, 2017) and the weight initialization (Nichol et al., 2018;Finn et al., 2017).
MAML (Finn et al., 2017) and the meta-learner LSTM (Ravi & Larochelle, 2017) are two well-known techniques that focus on these two types of priors.More specifically, MAML aims to learn a good weight initialization from which it can learn new tasks quickly using regular gradient descent.In addition to learning a good weight initialization, the meta-learner LSTM (Ravi & Larochelle, 2017) attempts to learn the optimization procedure in the form of a separate LSTM network.The meta-learner LSTM is more general than MAML in the sense that the LSTM can learn to perform gradient descent (see Section 4) or something better.This suggests that the performance of MAML can be mimicked by the meta-learner LSTM on few-shot image classification.However, our experimental results and those by Finn et al. (2017) show that this is not necessarily the case.The meta-learner LSTM fails to find a solution in the meta-landscape that learns as well as gradient descent.
To improve upon the meta-learner LSTM, we introduce TURTLE, which uses a fully-connected feed-forward network as an optimizer-a meta-network that is simpler than an LSTM.And, because it uses a meta-network, TURTLE is more expressive than MAML, as the meta-network can learn to perform gradient descent.We empirically demonstrate that TURTLE outperforms both of these techniques at few-shot sine wave regression and, without additional hyperparameter tuning, exceeds their performance in various settings involving the commonly used miniImageNet (Vinyals et al., 2016) and CUB (Wah et al., 2011) benchmarks.Our contributions are: • We formally show that the meta-learner LSTM subsumes MAML.
• We formulate a new meta-learning algorithm called TURTLE which uses a simpler meta-network than the meta-learner LSTM and is more expressive than MAML.
• We demonstrate that TURTLE outperforms both MAML and the meta-learner LSTM on sine wave regression, and various settings involving miniIma-geNet and CUB by at least 1% accuracy without any additional hyperparameter tuning.TURTLE requires roughly the same amount of computation time as second-order MAML.
• Based on the results of TURTLE, we enhance the metalearner LSTM by using raw gradients as meta-learner input and second-order information and show these changes result in a performance boost of 1-6% accuracy.

Related work
The success of deep learning techniques has been largely limited to domains where abundant data and large compute resources are available (LeCun et al., 2015).The reason for this is that learning a new task requires large amounts of resources.Meta-learning is an approach that holds the promise of relaxing these requirements by learning to learn.The field has attracted much attention in recent years.
One popular technique is MAML (Finn et al., 2017) which aims to find a good weight initialization from which new tasks can be learned quickly within several gradient update steps.Many works build upon the key idea of MAML, for example, to decrease the computational costs (Nichol et al., 2018;Rajeswaran et al., 2019), increase the applicability to online and active learning settings (Grant et al., 2018;Finn et al., 2018), or increase the expressivity of the algorithm (Li et al., 2017;Park & Oliva, 2019;Lee & Choi, 2018).Despite its popularity, MAML does no longer yield stateof-the-art performance on few-shot learning benchmarks (Lu et al., 2020), as it is surpassed by, for example, latent embedding optimization (LEO) (Rusu et al., 2019) which optimizes the initial weights in a lower-dimensional latent space, and MetaOptNet (Lee et al., 2019), which stacks a convex model on top of the meta-learned initialization of a high-dimensional feature extractor.However, although these approaches achieve state-of-the-art techniques on few-shot benchmarks, MAML is more elegant and more generally applicable as it can also be used in reinforcement learning settings (Finn et al., 2017).
While the meta-learner LSTM (Ravi & Larochelle, 2017) learns both an initialization and an optimization procedure, it is generally hard to properly train the optimizer (Metz et al., 2019).As a result, techniques that use hand-crafted learning rules instead of trainable optimizers may yield better performance.It is perhaps for this reason that most meta-learning algorithms use simple, hand-crafted optimization procedures to learn new tasks, such as regular gradient descent (Bottou, 2004), Adam (Kingma & Ba, 2015), or RMSprop (Tieleman & Hinton, 2017).Andrychowicz et al. (2016), however, show that learned optimizers may learn faster and yield better performance than gradient descent.
The goal of our work is to show that, despite the practical difficulties of the meta-learner LSTM, MAML can be outperformed by learning a separate meta-network. 1Our technique, dubbed TURTLE, replaces the LSTM module from the meta-learner LSTM with a feed-forward neural network.
Note that Metz et al. (2019) also used a regular feed-forward network as an optimizer.However, they were mainly concerned with understanding and correcting the difficulties that arise from training an optimizer and do not learn a weight initialization for the base-learner network as we do.Baik et al. (2020) also use a feed-forward network on top of MAML but its goal is to generate a per-step learning rate and weight decay coefficients.The feed-forward network in TURTLE, in contrast, generates direct weight updates.

Preliminaries
In this section, we explain the notation and the concepts of the works that we build upon.

Few-shot learning
In the context of supervised learning, the few-shot setup is commonly used as a testbed for meta-learning algorithms (Vinyals et al., 2016;Finn et al., 2017;Nichol et al., 2018;Ravi & Larochelle, 2017).One reason for this is the fact that tasks T j are small, which makes learning a prior across tasks not overly expensive.
Every task T j consists of a support (training) set D tr Tj and query (test) set D te Tj (Vinyals et al., 2016;Lu et al., 2020;Ravi & Larochelle, 2017).When a model is presented with a new task, it tries to learn the associated concepts from the support set.The success of this learning process is then evaluated on the query set.Naturally, this means that the query set contains concepts that were present in the support set.
In classification settings, a commonly used instantiation of the few-shot setup is called N -way k-shot learning (Finn et al., 2017;Vinyals et al., 2016).Here, given a task T j , every support set contains k examples for each of the N distinct classes.Moreover, the query set must contain examples from one of these N classes.
Suppose we have a dataset D from which we can extract J tasks.For meta-learning purposes, we split these tasks into three non-overlapping partitions: (i) meta-training, (ii) meta-validation, and (iii) meta-test tasks (Ravi & Larochelle, 2017;Sun et al., 2019).These partitions are used for training the meta-learning algorithm, hyperparameter tuning, and evaluation, respectively.Note that non-overlapping means that every partition is assigned some class labels which are unique to that partition.

MAML
As mentioned before, MAML (Finn et al., 2017) attempts to learn a set of initial neural network parameters θ from which we can quickly learn new tasks within T steps of gradient descent, for a small value of T .Thus, given a task T j = (D tr Tj , D te Tj ), MAML will produce a sequence of weights (θ (1) Here, α is the inner learning rate and L D (ϕ) the loss of the network with weights ϕ on dataset D. Note that the first set of weights in the sequence is equal to the initialization, i.e., θ (0) j = θ.Given a distribution of tasks p(T ), we can formalize the objective of MAML as finding the initial parameters (2) Note that the loss is taken with respect to the query set, whereas θ (T ) j is computed on the support set D tr T J .The initialization parameters θ are updated by optimizing this objective in Equation 2, where the expectation over tasks is approximated by sampling a batch of tasks.Importantly, updating these initial parameters requires backpropagation through the optimization trajectories on the tasks from the batch.This implies the computation of secondorder derivatives, which is computationally expensive.However, Finn et al. (2017) have shown that first-order MAML, which ignores these higher-order derivatives and is computationally less demanding, works just as well as the complete, second-order MAML version.

Meta-learner LSTM
The meta-learner LSTM by Ravi & Larochelle (2017) can be seen as an extension of MAML as it does not only learn the initial parameters θ but also the optimization procedure which is used to learn a given task.Note that MAML only uses a single base-learner network, while the meta-learner LSTM uses a separate meta-network to update the base-learner parameters.Thus, instead of computing (θ (2) j , ..., θ (T ) j ) using regular gradient descent as done by MAML, the meta-learner LSTM learns a procedure that can produce such a sequence of updates, using a separate meta-network.This trainable optimizer takes the form of a special LSTM module, which is applied to every weight in the base-learner network after the gradients and loss are computed on the support set.The idea is to embed the base-learner weights into the cell state c of the LSTM module.Thus, for a given task T j , we start with cell state c (0) j = θ.After this initialization phase, the base-learner parameters (which are now inside the cell state) are updated as where is the element-wise product, the two sigmoid factors σ are the parameterized forget gate f (t) j and learning rate i (t) j vectors that steer the learning process, Both the learning rate and forget gate vectors are parameterized by weight matrices W f , W i and bias vectors b f and b i , respectively.These parameters steer the inner learning on tasks and are updated using regular, hand-crafted optimizers after every meta-training task.As noted by Ravi & Larochelle (2017), this is equivalent to gradient descent when c (t) = θ (t) j , and the sigmoidal factors are equal to 1 and α, respectively.
In spite of the fact that the LSTM module is applied to every weight individually to produce updates, it does maintain a separate hidden state for each of them.In a similar fashion to MAML, updating the initialization parameters (and LSTM parameters) would require propagating backwards through the optimization trajectory for each task.To circumvent the computational costs associated with this expensive operation, the meta-learner LSTM assumes that input gradients and losses are independent of the parameters in the LSTM.

Towards stateless neural meta-learning
In this section, we study the theoretical relationship between MAML and the meta-learner LSTM.Based on the resulting insight, we formulate a new meta-learning algorithm called TURTLE (stateless neural meta-learning) which is simpler than the meta-learner LSTM and more expressive than MAML.

Theoretical relationship
There is an obvious relationship between MAML and the meta-learner LSTM.Instead of being restricted to using gradient descent, the meta-learner LSTM uses a meta-network that can learn the optimization procedure to learn new tasks.This observation leads to the following theorem: Theorem 1.The meta-learner LSTM subsumes MAML Proof.We prove this theorem through mathematical induction on the updates made by MAML and the meta-learner LSTM.
Suppose that both MAML and the meta-learner LSTM attempt to learn an arbitrary task T j .Without loss of generality, let us furthermore assume that both techniques are allowed to make T ∈ N updates on the support set D tr Tj and that MAML does this using gradient descent with a learning rate denoted by α.
To satisfy the base case, we can simply assume that MAML and the meta-learner LSTM start with the same initial parameters θ, i.e., c To show that the meta-learner LSTM could learn gradient descent as inner learning procedure, and thereby complete the proof, we have to show for t = 0, ..., T −1 that c . Substituting both the right-and left-hand sides using Equation 1 and Equation 3, this means that we have to show the possibility that where we used the shorthands By design, we have that c(t) Moreover, by our inductive hypothesis, we also know that c(t) j = θ (t) j .Thus, it remains to be shown that it is possible to have f Starting with the former, we have that Here, Equation 8 is due to Equation 5 and Equation 9 by assuming that W f is a matrix of zeros, which we can do as our only task is to show the mere possibility of a parameterization of W f and b f that yields similar behavior as gradient descent.Equation 10 follows from the definition of the sigmoidal function.In the last expression, n denotes the number of base-learner parameters.It is easy to see that this equation holds for sufficiently large b (m) f for m ∈ {1, ..., n}.
Lastly, we show the possibility that i (t) j = a.Leveraging the symmetry between the definitions of i for m ∈ {1, ..., n}.Since 0 < α < 1, this is indeed possible and thus the proof is complete.

Potential problems of the meta-learner LSTM
The theoretical insight that meta-learner LSTM subsumes MAML is not congruent with empirical findings which show that MAML outperforms the meta-learner LSTM on the miniImageNet image classification benchmark (Finn et al., 2017;Ravi & Larochelle, 2017), indicating that LSTM is unable to successfully navigate the error landscape to find a solution at least as good as the one found by MAML.
A potential cause is that the meta-learner LSTM attempts to learn a stateful optimization procedure.That is, the LSTM module has to learn state dynamics that allow for quick learning of new tasks.Learning such a stateful procedure requires additional parameters which may increase the complexity of the meta-landscape.We conjecture that removing the stateful nature of the trainable optimizer may smoothen the meta-landscape and allow for finding better solutions.
For this reason, we replace the LSTM module with a regular fully-connected, feed-forward network.
Another potential cause of the underperformance could be the first-order assumption made by the meta-learner LSTM, which we briefly mentioned in Section 3.3.Effectively, this disconnects the computational graph by stating that weight updates made at time step t by the meta-network do not influence the inputs that this network receives at future time steps t < t < T .It has to be noted that this assumption and its consequences are of a substantially different nature than the assumption made by first-order MAML, which was shown to leave the performance mostly unaffected (Finn et al., 2017).More specifically, first-order MAML ignores the optimization trajectory and simply updates the baselearner parameters θ in the opposite direction of the gradients of the query set loss, i.e., −∇ θ (T ) When the base-level landscape is reasonably locally smooth, these query set update directions point towards the locally optimal parameters of the task.In turn, first-order MAML moves the initialization parameters closer to these locally optimal parameters.It is thus intuitive that this first-order assumption is mostly harmless.For the meta-learner LSTM, however, we fail to identify such an intuition that justifies the first-order assumption.Moreover, Ravi & Larochelle (2017) have not shown an empirical justification for this assumption.

TURTLE
In an attempt to make the meta-landscape easier to navigate, we introduce a new algorithm, TURTLE, which trains a feedforward meta-network to update the base-learner parameters.TURTLE is simpler than the meta-learner LSTM as it uses a stateless feed-forward neural network as a trainable optimizer, yet more expressive than MAML as its meta-network can learn to perform gradient descent.
The trainable optimizer in TURTLE is thus a fullyconnected feed-forward neural network.We denote the batch of inputs that this network receives at time step t in the inner loop for task T j as I (t) j ∈ R n×d , where n and d are the number of base-learner parameters and the dimensionality of the inputs, respectively.The exact inputs that this network receives will be determined empirically, but two choices, inspired by the meta-learner LSTM, are: (i) the gradients with respect to all parameters and (ii) the current loss (repeated n times for each parameter in the base-network).
Moreover, could mitigate the absence of a state in the meta-network by including a time step t ∈ {0, 1, ..., T − 1} and/or historical information such as a moving average of previous gradients or updates made by the meta-network.We denote the latter by h (t) j which is updated by where 0 ≤ β ≤ 0 is a constant that determines the time span over which previous inputs affect the new state h (t+1) j , and v (t) j ∈ R n is the new information (either the updates or gradients at time step t).When using previous updates, we initialize h (0) j by a vector of zeros.Weight updates are then computed as follows where α ∈ R n is a vector of learning rates per parameter.Note that this weight update equation is simpler than the one used by the meta-learner LSTM (see Equation 3) as our meta-network g φ is stateless.Therefore, we do not have parameterized forget and input gates.Moreover, the learning rates per parameter in α are not constrained to be within the interval [0, 1] as is the case for the meta-learner LSTM due to the use of the sigmoid function.
In Algorithm 1 we show, in different colors, the code for MAML (red), the meta-learner LSTM (blue), and TURTLE (green).Although the code structure of the three metalearners is similar, the update rules are quite different.Both the base-and meta-learner parameters θ and φ are updated by backpropagation through the optimization trajectories (line 11).

Experiments
In this section, we describe our experimental setup and the results that we obtained.

Problem descriptions
For our experimental evaluation, we use two problems: sine wave regression and image classification.

SINE WAVE REGRESSION
The sine wave regression problem was originally proposed by Finn et al. (2017).In this setup, every task T j corresponds to a different sine wave function s j (x) = a • sin(x − p), where a ∈ [0.1, 5.0] is the amplitude, and p ∈ [0, π] the phase.Both of these parameters are selected uniformly at random from their corresponding ranges.
Given a task T j = (D tr Tj , D te Tj ), the goal for the base-learner network is to infer the sine wave that gives rise to observations (x i , y i ) in the support set of a task.The quality of this inference is determined with the mean squared error (MSE) of the observations from the query set D te Tj .We use the same base-learner network as Finn et al. (2017), i.e., a fully-connected feed-forward network consisting of a single input node followed by two hidden layers with 40 ReLU nodes each and a final single-node output layer.We use the MSE loss function to train the model.Influence of the order, number of update steps, and number of hidden layers (horizontal axis) on the meta-validation performance of TURTLE on 5-shot sine wave regression.We also plot the performance of first-and second-order MAML for comparison.Note that a lower MSE loss corresponds to better performance.
whereas every query set D te Tj contains 50 examples to ensure proper evaluation of the inner learning process.We perform 30 runs with different random weight initializations for every experiment and perform meta-validation every 2.5K tasks.The best validation performances will be averaged and reported as the validation performance of an algorithm.We also include 95% confidence intervals.The best validation model per run will also be evaluated on the meta-test tasks, for which we adopt a similar evaluation protocol.

IMAGE CLASSIFICATION
We also use the popular miniImageNet benchmark (Vinyals et al., 2016), with the class splits proposed by Ravi & Larochelle (2017), which were also used by Finn et al. (2017).In addition, we also use the CUB benchmark (Wah et al., 2011).Both benchmarks adhere to the N -way k-shot paradigm described in Section 3.1.Following Chen et al. (2019), we use 16 examples per class in every query set.
Moreover, we use the same base-learner network as used by Snell et al. (2017) and Chen et al. (2019).This network is a stack of four identical convolutional blocks.Each block consists of 64 convolutions of size 3 × 3, batch normalization, a ReLU nonlinearity, and a 2D max-pooling layer with a kernel size of 2. The resulting embeddings of the 84×84×3 input images are flattened and fed into a dense layer with N nodes (one for every class in a task).The base-learner is trained to minimize the cross-entropy loss on the query set, conditioned on the support set.
Following Chen et al. (2019), we use 600 meta-validation tasks, and 600 meta-test tasks.Furthermore, when the number of examples per class is k = 1, we use 60k meta-training tasks.For k = 5, we train on 40K meta-training tasks.We perform 5 runs with different random weight initializations for every experiment and, as with sine wave regression, we perform meta-validation every 2.5K tasks.The same evaluation protocol as used for sine wave regression applies, with the exception that we are now interested in the accuracy instead of the MSE loss.
Importantly, we investigate two main scenarios.In the first scenario, the within-distribution setting, the techniques are evaluated on tasks from the same data set that was used for meta-training (e.g., train on miniImageNet and evaluate on miniImageNet).In the second scenario, the outof-distribution setting proposed by Chen et al. (2019), the techniques are evaluated on tasks from a different data set (CUB) than the one used for meta-training (miniImageNet).

Hyperparameter optimization
First, we investigate the effect of the order of information (first-versus second-order), the number of updates T per task, and further increasing the number of layers of the meta-network on the performance of TURTLE on sine wave regression.The results are displayed in Figure 1.Note that in this experiment, we fixed the learning rate vector α to be a vector of ones, which means that the updates proposed by the meta-network are directly added to the base-learner parameters without any scaling.Moreover, the only input that the meta-network receives is the gradient of the loss on the support set with respect to a base-learner parameter, and every hidden layer of the meta-network consists of 20 nodes followed by ReLU nonlinearities.
As we can see, the difference between first-and secondorder MAML is relatively small, which was also found by Finn et al. (2017).In contrast, this is not the case for TUR-TLE, where the first-order variant fails to achieve a similar performance as second-order TURTLE.Furthermore, we see that the stability of TURTLE decreases as T increases.
Table 1.Median meta-test accuracy scores and 95% confidence intervals over 5 runs of 5-way image classification on miniImageNet (left) and CUB (right).The best performance is displayed in bold font.Note that a higher accuracy indicates better performance.Lastly, we find that 5 or 6 hidden layers yield the best performance across different values of T .For this reason, all further TURTLE experiments will be conducted with a meta-network of 5 hidden layers.
In an attempt to stabilize TURTLE and further improve the performance, we experimented with the inclusion of additional input information (Section 4.3) and larger metabatches.This was done through individual grid searches on top of the meta-network with 5 hidden layers.The used grids are displayed in Table 2.The best settings were: (no, gradients, 0 , 1 ), (yes, gradients, 0 .9, 2 ), and (yes, gradients, 0 .3, 4 ) for T = 1, T = 5, and T = 10 updates per task, respectively., 2, 4, 8, 16, 32, 64 In Figure 2, we compare the 5-shot meta-validation performances of these tuned TURTLE models with those of MAML and the meta-learner LSTM (for which we used the hyperparameters reported by the original authors).As we can see, TURTLE outperforms both MAML and the meta-learner LSTM.Lastly, we note that 5-step TURTLE achieves the best performance.

Image classification results
Without additional hyperparameter tuning, we now investigate the performance of 5-step TURTLE on image classification tasks.An overview of the hyperparameters that were used for all techniques can be found in the supplementary material.We compare the performance against three simple transfer-learning models, following Chen et al. (2019): train from scratch, finetuning, and baseline++.The meta-test accuracy scores on 5-way miniImageNet and CUB classification are displayed in Table 1.Note that we use the best-reported hyperparameters for MAML and the metalearner LSTM on miniImageNet, while we use the best hyperparameters found on sine wave regression for TURTLE.Based on our hyperparameter experiments for TURTLE, we also investigate an enhanced version of the meta-learner LSTM which uses raw gradients as meta-learner input and second-order information.On CUB, we use the same hyperparameters as on miniImageNet.
As we can see, the performances of all models are better on 5-shot classification compared with 1-shot classification.Looking at the results for miniImageNet, we see that TURTLE yields the best performance in three settings.On the CUB data set, however, TURTLE is outperformed by MAML in the 5-shot setting, while the reverse holds in the 1-shot setting.Moreover, we see that our enhanced version of the meta-learner LSTM achieves better performance than the original meta-learner LSTM, and is on par with TURTLE in two settings.

Cross-domain performance and time complexity
We also investigate the robustness of the meta-learning algorithms when a task distribution shifts occurs.For this, we train the techniques on miniImageNet and evaluate their performance on CUB tasks, following Chen et al. (2019).The results are shown in Table 3.As we can see, TURTLE also achieves the best performance in this challenging scenario.Furthermore, we see that our enhanced meta-learner LSTM achieves better performance than the original, especially in the 5-shot setting.Lastly, we compare the running times of MAML, the metalearner LSTM, and TURTLE on miniImageNet and CUB.
A run comprises the time it costs to perform meta-training, meta-validation, and meta-testing on miniImageNet, and evaluation on CUB.We measure the average time in full hours across 5 runs on nodes with a Xeon Gold 6126 2.6GHz 12 core CPU and PNY GeForce RTX 2080TI GPU.The results are displayed in Table 4.As we can see, the first-order algorithms (fo-MAML and the meta-learner LSTM) are the fastest, while the second-order algorithms are slower (so-MAML and TURTLE).Note that the original meta-learner LSTM is slower at 1-shot learning compared with 5-shot learning due to the fact that it makes 12 update steps per task in the former and only 5 in the latter.TURTLE is, despite its name, not much slower than the other second-order approach (so-MAML), indicating that the time complexity is dominated by learning the base-learner initialization parameters.

Discussion and future work
In this work, we have formally shown that the meta-learner LSTM (Ravi & Larochelle, 2017) subsumes MAML (Finn et al., 2017).Experiments of Finn et al. (2017) and our- We empirically demonstrate that TURTLE outperforms both MAML and the meta-learner LSTM on sine wave regression and-without additional hyperparameter tuning-on the frequently used miniImageNet benchmark.This shows that better update rules exist for fast adaptation than regular gradient descent, which is in line with findings by Andrychowicz et al. (2016).
Our hyperparameter analysis on sine wave regression shows that second-order gradients are crucial for achieving good performance with TURTLE.In contrast, first-order MAML is a good approximation to second-order MAML as it yields similar performance (Finn et al., 2017).This finding highlights the distinction between the base-and meta-level goals.
On the base-level, we wish to find an initialization that is close to the optimal parameters for tasks T j p(T ).Assuming reasonable local smoothness of the base-level landscape, we could ignore our optimization trajectory θ as the gradient of the query loss at θ (T ) j will presumably point towards the direction of the optimal parameters for task T j , and hence we can move the initialization θ in that direction.On the meta-level, in contrast, we wish to learn an optimization strategy, which is sequential in nature.This implies that an update at time step t influences the gradient inputs that the meta-network will receive at time steps t < t < T .The consequence of ignoring second-order gradients, as done by the meta-learner LSTM, is that the computation graph becomes disconnected, which makes the meta-network unaware that an update at time step t influences the inputs at future time steps t < t < T .Moreover, we enhanced the meta-learner LSTM by using raw gradients as meta-learner input and second-order information, as they were found to be important for TURTLE.Our results indicate that this enhanced version of the metalearner LSTM systematically outperforms the original tech-nique by 1 − 6% accuracy.A promising direction for future work may be to investigate why TURTLE has a slight edge compared with the enhanced meta-learner LSTM.Anothersomewhat related-direction would be to further enhance the meta-learner LSTM, for example, by using meta-batches, as that also increased the training stability of TURTLE.As a side note, we believe that the performance of TURTLE could also be improved on miniImageNet and CUB by performing hyperparameter tuning on these specific data sets.
While TURTLE and the enhanced meta-learner LSTM were shown to yield good performance, it has to be noted that this comes at the cost of increased computational expenses compared with first-order algorithms.That is, these secondorder algorithms perform backpropagation through the entire optimization trajectory which requires storing intermediate updates and the computation of second-order gradients.While this is also the case for MAML, it has been shown that first-order MAML achieves a similar performance whilst avoiding this expensive backpropagation process.For TUR-TLE, however, this is not the case, which means that other approaches should be investigated in order to reduce the computational costs.Future research may draw inspiration from Rajeswaran et al. (2019) who approximated secondorder gradients in order to speed up MAML.
Successfully using meta-learning algorithms in scenarios where task distribution shifts occur remains an important open challenge in the field of meta-learning.Our crossdomain experiment demonstrates that the learned optimization procedure by TURTLE generalizes to different tasks than the ones seen at training time, which is in line with findings by Andrychowicz et al. (2016).For this reason, we think that learned optimizers may be an important piece of the puzzle to broaden the applicability of meta-learning techniques to real-world problems.Future work can further investigate this hypothesis.
In short, our findings show the benefit of learning an optimizer in addition to the initialization weights and highlight the importance of second-order gradients.
use the above derivation and conclude that we have to show that it is possible to have α = σ(b i ).It is straightforward to show that this equation is satisfied when b

1:
Initialize parameters Θ = {θ} {θ, φ} {θ, φ} 2: Initialize g φ as N.A. LSTM feed-forward network 3 Figure1.Influence of the order, number of update steps, and number of hidden layers (horizontal axis) on the meta-validation performance of TURTLE on 5-shot sine wave regression.We also plot the performance of first-and second-order MAML for comparison.Note that a lower MSE loss corresponds to better performance.

Figure 2 .
Figure 2. Meta-validation performance of MAML, the metalearner LSTM, and TURTLE on 5-shot sine wave regression.Note that a lower error indicates better performance.

Table 2 .
The hyperparameter grids that we used for tuning TUR-TLE on sine wave regression.

Table 3 .
Median meta-test accuracy scores on 5-way CUB after being trained on miniImageNet.The median accuracy and 95% confidence intervals were computed over 5 runs.The meta-learner LSTM * refers to our enhanced version of the meta-learner LSTM, which takes raw gradients as inputs, uses second-order gradients, and makes 8 updates per task.

Table 4 .
Average running time in hours of 5-step MAML, the metalearner LSTM, and TURTLE on 5-way miniImageNet and CUB.The standard deviations across the five runs are given by ±x.