Optimizing quantum heuristics with meta-learning

Variational quantum algorithms, a class of quantum heuristics, are promising candidates for the demonstration of useful quantum computation. Finding the best way to amplify the performance of these methods on hardware is an important task. Here, we evaluate the optimization of quantum heuristics with an existing class of techniques called `meta-learners'. We compare the performance of a meta-learner to Bayesian optimization, evolutionary strategies, L-BFGS-B and Nelder-Mead approaches, for two quantum heuristics (quantum alternating operator ansatz and variational quantum eigensolver), on three problems, in three simulation environments. We show that the meta-learner comes near to the global optima more frequently than all other optimizers we tested in a noisy parameter setting environment. We also find that the meta-learner is generally more resistant to noise, for example seeing a smaller reduction in performance in Noisy and Sampling environments and performs better on average by a `gain' metric than its closest comparable competitor L-BFGS-B. These results are an important indication that meta-learning and associated machine learning methods will be integral to the useful application of noisy near-term quantum computers.


I. INTRODUCTION
Machine learning is a powerful tool for tackling challenging computational problems [1][2][3]. A recent explosion in the number of machine learning applications is driven by the availability of data, improved computational resources and deep learning innovations [4][5][6]. Interestingly, machine learning has also been applied to the problem of improving machine learning models, in a field known as meta-learning [7,8].
In general, meta-learning is the study of models which 'learn to learn'. A prominent example of a meta-learner model is one that learns how to optimize parameters of a function [9][10][11][12]. Traditionally, this function might be a neural network [9] or a black-box [12]. Meta-learning and other new methods, including Auto-ML [13], are changing the way we train, use and deploy machine learning models [14][15][16]. Here, we use a meta-learner to find good parameters for quantum heuristics, and compare that approach to other parameter optimization strategies. Figure 1 shows an example of what the implementation of a meta-learner might look like, in the context of optimizing the parameters of a parametrized quantum circuit, illustrated as a quantum processing unit (QPU). In this work, we refer to a QPU and a quantum circuit interchangeably.
Recent progress in quantum computing hardware has encouraged the development of quantum heuristic algorithms that can be simulated on near-term devices [17,18]. One important heuristic approach involves a class of algorithms known as variational quantum algorithms. Variational quantum algorithms are 'hybrid' quantum-classical algorithms in which a quantum circuit FIG. 1. Meta-learner training on a Quantum Processing Unit (QPU -green). This diagram illustrates how the meta-learner used in this work can optimize the parameters of a quantum circuit (see Section III for a full description). Here, we outline a high level description for each time-step, such as T − 2 (shown). A model, in our case a long short-term memory (LSTM) recurrent neural network (blue) (Section II), takes in the gradients of the cost function. The LSTM outputs parameters φ for the QPU to try at the next step. This procedure takes place over several time-steps in a process known as unrolling. The costs from each time-step are summed to compute the loss, L (purple), at time T .
is run multiple times with variable parameters, and a classical outer loop is used to optimize those parameters (see Figure 2). The Variational Quantum Eigensolver (VQE) [19], quantum approximate optimization algorithm and its generalization Quantum Alternating Operator Ansatz (QAOA) [20,21] 2. A single time-step of a general variational quantum algorithm, where the classical processing unit (CPU -blue) outputs parameters φ dependent on some evaluation, in this case the expectation value H by the quantum processing unit (QPU -green). The quantum subroutine is encoded by a quantum circuit U ( φ) (Figure 3) parameterized by φ, and it is responsible for generating a state |ψ( φ) . This state is measured in order to extract relevant information (e.g. expectation value of a Hamiltonian). The classical subroutine suggests parameters φ based on the values provided by a quantum computer, and sends new parameters back to the quantum device. This process is repeated until the given goal is met, i.e. convergence to a problem solution (e.g. the ground state of a Hamiltonian).
gorithms that can be implemented in this variational setting. These algorithms are effective in optimization [22][23][24] and simulation of quantum systems [25][26][27]. The classical subroutine is an optimization of parameters, and is an important part of the algorithm both in terms of the quality of solution found and the speed at which it is found.
Techniques for the classical outer loop optimization are well-studied [19,22,[28][29][30][31] and several standard optimization schemes can be used. However, optimization in this context is difficult, due to technological restrictions (e.g. hardware noise), and to theoretical limitations such as the stochastic nature of quantum measurements [32] or the barren plateaus problem [33]. Therefore, it is imperative to improve not only the quantum part of the hybrid algorithms, but also to provide a better and more robust framework for classical optimization. Here, we focus on the classical optimization subroutine, and suggest metalearning as a viable tool for parameter setting in quantum circuits. Moreover, we demonstrate that these methods, in general, are resistant to noisy data, concluding that these methods may be especially useful for algorithms implemented with noisy quantum hardware.
We compare the performance of optimizers for parameter setting in quantum heuristics, specifically variational quantum algorithms. The optimization methods we compare are L-BFGS-B [34], Nelder-Mead [35], Gaussian process regression (referred to here as Bayesian optimization) [36], evolutionary strategies [37] and a Long Short Term Memory (LSTM) recurrent neural network model [38] -the meta-learner. Whilst in the production of this work, we noticed similar research [39] exploring the potential of gradient-free meta-learning techniques as initializers. Here, we use a gradient-based version of the meta-learner as a standalone optimizer (not an initializer), and a larger set of other optimizers. Though we include a diverse range of techniques, clearly, there are other optimizers that might be used, for example SPSA [40][41][42][43], however our analysis focuses on those described above.
This comparison is performed in three different simulation environments: Wave Function, Sampling and Noisy. The Noisy environment is an exact wave function simulation with parameter setting noise. The simulation environments are defined in detail in Section III.
The first heuristic we explore for this comparison is QAOA [20,21] for the MAX-2-SAT and Graph Bisection constraint satisfaction problems [44]. Second, VQE [19] is used for estimating the ground-state of Fermi-Hubbard models [45]. We show that, broadly speaking, the metalearner performs as well or better than the other optimizers, measured by a 'gain' metric defined in Section IV. Most notably, the meta-learner is observed to be more robust to noise. This is highlighted through showing the number of near-optimal solutions found in each problem by the different optimizers over all simulation environments. The takeaway of this paper is that these methods show promise, specifically the features of robustness and adaptability to hardware, and how meta-learning might be applied to noisy near-term devices.
In Section II we describe the background of the heuristics and optimizers. Then in Section III we outline the general setup including problems, the optimizers, and the simulation environments. Section IV details the meth-ods, including the metrics, optimizer configuration and meta-learner training. In Section V we discuss our results. Finally in Section VI the work is summarized and we suggest paths forward.

A. Quantum Alternating Operator Ansätz
The quantum approximate optimization algorithm [20] and its generalization the quantum alternating operator ansatz [21] (QAOA) form families of parameterized quantum circuits for generating solutions to combinatorial optimization problems. After initializing a suitable quantum state, a QAOA circuit consists of a fixed number p blocks (see Figure 3), where each block is composed of a phase unitary generated from the cost function we seek to optimize, followed by a mixing unitary. The phase unitary typically yields a sequence of multiqubit Pauli-Z rotations each with phase angle γ. In the original proposal of Farhi et al. [20], the mixing unitary is a Pauli-X rotation of angle β on each qubit. However, extending the protocol to more general encodings and problem constraints naturally leads to a variety of more sophisticated families of mixing operators [21,46]. At the end of the circuit a measurement is performed in the computational (Pauli-Z) basis to return a candidate problem solution.
An important open research area is to develop strategies for determining good sets of algorithm parameters (i.e. the γ and β values for each block) which yield good (approximate or exact) solutions with nonnegligible probability. These parameters may be determined a priori through analysis, or searched for as part of a classical-quantum hybrid algorithm using a variational or other approach. Prior work on parameter setting in QAOA includes analytic solutions for special cases [47], comparison of analytical and finite difference methods [30], a method for learning a model for a good schedule [28], and comparison of standard approaches over problem classes [31].
We evaluate parameter setting strategies for QAOA for MAX-2-SAT and Graph Bisection, both NP-hard combinatorial optimization problems [44,48]. We use standard' [20] and generalized [21] QAOA methods, respectively. The latter problem mapping is of particular interest as it utilizes an advanced family of QAOA mixing operators from [21] that has recently been demonstrated to give advantages over the standard mixer [49].

B. Variational Quantum Eigensolver
The VQE [19] is a hybrid optimization scheme built on the variational principle. It aims to estimate the ground state energy of a problem Hamiltonian through iterative improvements of a trial wave function. The trial wave function is prepared as a quantum state using a parame-terized quantum circuit, and the expectation value of the Hamiltonian with respect to this state is measured. This energy value is then passed to a classical device, which uses optimization techniques (SPSA, BFGS, etc.) to update the parameters. The process is repeated for a fixed number of iterations, or until a given accuracy achieved.
The initial demonstration of VQE used Nelder-Mead, a standard derivative-free approach, for parameter setting after observing that gradient descent methods did not converge [19]. Since then, examples in the literature include the use of Simultaneous Perturbation Stochastic Approximation (SPSA) in [42], where the authors argue simultaneous perturbation methods might be particularly useful for fermionic problems, but classical problems (such as MaxCut) may favor more standard techniques (i.e. gradient descent). Other routines used include COBLYA, L-BFGS-B, Nelder-Mead and Powell in [50]. Finally, in [51] the authors explore the use of Bayesian optimization for parameter setting in VQE.

C. Meta-learning
Meta-learning is the study of how to design machine learning models to learn fast, well and with few training examples [52]. One specific case is a model, referred to here as a meta-learner [11], which learns how to optimize other models. A model is a parameterized function. Meta-learners are not limited to training machine learning models; they can be trained to optimize general functions [12]. In the specific area of using models to optimize other models, early research explored Guided Policy Search [10], which has been superceded by LSTMs [9,12,53,54]. An LSTM is a recurrent neural network, developed to mitigate vanishing or exploding gradients prominent in other recurrent neural network architectures [55,56]. It consists of a cell state, a hidden state, and gates, and all three together are called an LSTM cell. At each time-step, changes are made to the cell state dependent on the hidden state, the gates (which are models) and the data input to the LSTM cell. The hidden state is changed dependent on the gates and the input. The cell state and hidden state are then passed to the LSTM cell at the next time-step. A full treatment of an LSTM is given in Reference [38]. An LSTM is good for learning long-term (over many time-steps) dependencies, like those in optimization.
Meta-learners have been used for fast general optimization of models with few training examples [11]: Given random initial parameters we seek to achieve a fast convergence to 'good' (defined by some metric) general parameters. This same problem feature appears for QAOA, where good parameters may follow some common distribution across problems [28]. A meta-learner could be used to find general good parameters, and fine-tuning left to some other optimizer [57], though this approach was not explored here.

A. Simulation Environments
We compare optimization methods in 'Wave Function', 'Sampling' and 'Noisy' simulation environments. The Wave Function case is an exact wave function simulation. For Sampling, the simulation emulates sampling from a hardware-implemented quantum circuit, where the variance of the expectation value evaluations is dependent on the number of samples taken from the device.
In these experiments, we set the number of shots (samples from the device) to 1024.
Lastly, in the Noisy case we have modelled parameter setting noise in an exact wave function simulation. We assume exact, up to numerical precision, computation of the expectation value (via some theoretical quantum computer which can compute the expectation value of a Hamiltonian given a state up to arbitrary precision). Then, for each single-qubit rotation gate, we added normally distributed, standard deviation σ = 0.1, noise to the parameters at each optimization step. In order to determine σ, we evaluated the relationship between the fidelity of an arbitrary rotation (composed of three singlequbit Pauli rotation gates R Z (α)R Y (β)R Z (γ)), around the Bloch sphere and parameter noise; see Figure 4. Assuming industry standard single qubit gate rotations of 99% [58], a value of σ = 0.1 is approximated, see

Local optimizers
Nelder-Mead and L-BFGS-B are gradient-free and gradient-based approaches, respectively, which are standard local optimizers [28][29][30][31]. Local optimizers have a notion of location in the solution space. They search for candidate solutions from this location. They are usually fast, and are susceptible to finding local minima. L-BFGS-B is a local optimizer and has access to the gradients. Out of all optimizers chosen it is the closest to the meta-learner in terms of information available to the optimizer and computational burden (i.e. the cost of computing the gradients). Nelder-Mead was chosen as it appears throughout the literature [19,30,50,57] and provides a widely recognized benchmark.

Bayesian Optimization
Global optimizers are designed to search for a global optima, and are generally more computationally intensive. An important class of global black-box optimizers we consider are Bayesian optimizers.
Bayesian optimization, also known as Gaussian process regression, involves computing updates to a posterior probability distribution over candidate functions, given a prior distribution and training examples [60]. The training examples are function input-output pairs. The runtime of Bayesian optimization scales as O(N 3 ) where N is the number of training points. It is useful for finding a global optima with the minimum number of steps [36].

C. Evolutionary Strategies
Evolutionary strategies are another class of global black-box optimization techniques: A population of candidate solutions (individuals) are maintained, which are evaluated based on some cost function. Genetic algorithms and evolutionary strategies have been used for decades. More recent work has shown these techniques to be competitive in problems of reinforcement learning [37,61].
All implementations of evolutionary strategies are population-based optimizers. In the initial iteration, the process amounts to a random search. In each iteration, solutions with lower costs are more likely to be selected as parents (though all solutions have a nonzero probability of selection). Different methods for selecting parents exist, but we used binary tournament selection, in which two pairs of individuals are selected, and the individual with the lowest cost from each pair is chosen to be a parent.
In more precise terms, parents are the candidate solutions selected to participate in crossover. Crossover takes two parent solutions and produces two children solutions by randomly exchanging the bitstring defining the first parent with the second. Each child replaces its parent in the population of candidate solutions. The process is repeated, so costs for each child are evaluated, and these children are used as parents for the next iteration [62]. In our case, the bitstring is divided into n subsections, where n is the number of parameters passed to the quantum heuristic. Each subsection is converted to an integer using Gray encoding and then interpolated into a real value in the range [−π/2, π/2]. Gray codes are used as they avoid the Hamming walls found in more standard binary encodings [63].
It is the bitstrings that are operated on by the genetic algorithm. When two individuals are selected to reproduce, a random crossover point, b c is selected with probability P c . Two children are generated, one with bits left of b c from the first parent and bits to to the right of b c originating from the second parent. The other child is given the opposite arrangement. Intuitively, if b c is in the region of the bitstring allocated to parameter φ k , the first child will have angles identical to the first parent before φ k and angles identical to the second parent after φ k . Again, the second child has the opposite arrangement. The effect on parameter φ k is more difficult to describe. Finally, after crossover is complete, each bit in each child's bitstring (chromosome) is then flipped (mutated) with probability P m . Mutation is useful for letting the algorithm explore candidate solutions that may not be accessible through crossover alone.
Evolutionary strategies are highly parallelizable, robust and relatively inexpensive [37]. Both Bayesian optimization and evolutionary strategies are good candidates for optimizing quantum heuristics and are used here.

Meta-learning on quantum circuits
The meta-learner used in this work is an LSTM, shown unrolled in time in Figure 1. Unrolling is the process of iteratively updating the inputs, x, cell state and hidden state, referred to together as s, of the LSTM. Inputs to the model were the gradients of the cost function w.r.t. the parameters, preprocessed by methods outlined in the original work [9]. At each time-step they are where r is a scaling parameter, here set to 10, following standard practice [9,11]. The terms ∇ H t are the gradients of the expectation value of the Hamiltonian at timestep t, with respect to the parameters φ t . This preprocessing handles potentially exponentially-large gradient values whilst maintaining sign information. Explicitly, the meta-learner used here is a local optimizer. At some point φ t in the parameter-space, where t is the time-step of the optimization, the gradients x t are computed and passed to the LSTM as input. The LSTM outputs an update ∆ φ t , and the new point in the parameter space is given by φ t+1 = φ t +∆ φ t . It is possible to use these models for derivative-free optimization [12], however, given that the gradient evaluations can be efficiently performed on a quantum computer, scaling linearly with the number of gates, and that the optimizers usually perform better with access to gradients, we use architectures here that exploit this information. In Reference [33] the authors show that the gradients of the cost function of parameterized quantum circuits may be exponentially small as a function of the number of qubits, the result of a phenomena called the concentration of quantum observables. In cases where this concentration is an issue, there may be strategies to mitigate this effect [64], though it is not an issue in the small problem sizes used here. Though only one model (a set of weights and biases) defines the meta-learner, it was applied in a 'coordinatewise' way: For each parameter a different cell state and hidden state of the LSTM are maintained throughout the optimization. Notably, this means that the size of the meta-learning model is only indirectly dependent on the number of parameters in the problem. We used a gradient-based approach, exploiting the parameter-shift rule [65] for computing the gradients of the loss function with respect to the parameters. These were used at both training and test time.
All model training requires some loss function. We chose the summed losses, where E f is the expectation over all training instances f and T is a time-horizon (the number of steps the LSTM is unrolled before losses from the time-steps t < T are accumulated, backpropagated, and the model parameters updated). The hyperparameters ω t are included, though are set to ω t = 1 for all t in these training runs. This can be adjusted to weigh finding optimal solutions later in the optimization more favourably, a practice for balancing exploitation and exploration. In situations where exploration is more important, other loss functions can be used, such as the expected improvement or observed improvement [12]. However, in this instance we chose a loss function to rapidly converge, meaning fewer calls to the QPU. This has the effect of converging to local minima in some cases, though we found that this loss function performed better than the other gradient-based optimizer (L-BFGS-B) for these problems.
D. Problems

Fermi-Hubbard Model
Hubbard Hamiltonians have a simple form, as follows: where a † i,σ , a i,σ are creation and annihilation operators, respectively, of a particle at site i with spin σ. In this model there is a hopping term t, a many body interaction term U and an onsite chemical potential term µ. This model gained importance as being a possible candidate Hamiltonian to describe superconductivity in cuprate materials. However, recent numerical studies have shown that there are some significant differences between the model and what is seen in experiments, such as the periodicity of charged stripes that the model supports [66][67][68]. However, the model is quite interesting itself, with many different phases of interest. The model is also quite difficult to solve, especially when going to large lattice sizes and large values of U/t. This has lead to many studies and much method development on classical computers, and is still widely researched today.
For VQE we look for the ground-state of the simplified spinless three-site Fermi-Hubbard model with unequal coupling strengths t ij ∈ [−2, 2] and U = µ = 0, Figure 6. The Hamiltonian of this model can be mapped through the Jordan-Wigner transformation [69] to the qubit Hamiltonian Based on the results of [70,71], we use a circuit composed of 3 blocks. Each block consists of three single qubit rotations R Z (α)R Y (β)R Z (γ) applied to all qubits, followed by entangling CNOT gates acting on qubits (1,2) and (2,3), where the first entry is the control qubit and the second is the target.

MAX-2-SAT
Given a Boolean formula on n variables in conjunctive normal form (i.e. the AND of a number of disjunctive two-variable OR clauses), MAX-SAT is the NP-hard problem of determining the maximum number of clauses which may be simultaneously satisfied. The best classical efficient algorithm known achieves only a constant factor approximation in the worst case, as deciding whether a solution exists that obtains better than a particular constant factor is NP-complete [44]. For MAX-2-SAT, where each clause consists of two literals, the number of satisfied clauses can be expressed as wherex i in each clause represents the binary variable x i or its negation, and E is the set of clauses. We use an n-qubit problem encoding where the jth qubit logical states |0 j , |1 j encode the possible values of each x j . Transforming to Ising spin variables [72] and substituting with Pauli-Z matrices leads to the cost Hamiltonian which is minimized when the number of satisfied clauses is maximized. The sign factors +1 or −1 in C correspond to whether each clause contains x i or its negation, respectively. Note that C and C are not equivalent; C gives a maximisation problem, while C gives a minimization problem, with the same set of solutions. For our QAOA implementation of MAX-2-SAT we use the original [20] initial state |s = 1 √ 2 n x |x , phase operator U P ( C, γ) = exp −iγ C , and mixing operator . The example instances we consider below have n = 8 qubits, 8 clauses, and QAOA circuit depth p = 3.

Graph Bisection
Given a graph with an even number of nodes, the Graph Bisection problem is to partition the nodes into two sets of equal size such that the number of edges across the two sets is minimized. The best classical efficient algorithm known for this problem provably yields only a log-factor worst-case approximation ratio [73]. Both this problem and its maximization variant are NP-hard [44].
For an n-node graph with edge set E we encode the possible node partitions with n binary variables, where x j encodes the placement of the jth vertex. In this encoding, from the problem constraints the set of feasible solutions is encoded by strings x of Hamming weight n/2. The cost function to minimize can be expressed as under the condition n j=1 x j = n/2. Transforming again to Ising variables gives the cost Hamiltonian A mapping to QAOA for this problem was given in [21, App. A.3.2] from which we derive our construction. We again encode possible partitions x with the n-qubit computational basis states |x . For each problem instance we uniformly at random select a string y of Hamming weight n/2 and use the feasible initial state |y . The phase operator U P ( C, γ) = exp −iγ C is constructed in the usual way from the cost Hamiltonian. For the mixing operator we employ a special case of the XY -mixer proposed in [21]. This class of mixers affect state transitions only between states of the same Hamming weight, which will importantly restrict the quantum state evolution to the feasible subspace. For each node j = 1, . . . , n, we define the XY partial mixer with σ (n+1) := σ (1) . We define the overall mixer to be the ordered product U M (β) = U n (β) . . . U 2 (β)U 1 (β). Observe that as each partial mixer preserves feasibility, so does U M (β), and so QAOA will only output feasible solution samples. We consider problem instances with n = 8 qubits, 8 edges, and QAOA circuit depth p = 3.

A. Metrics
Here, we outline two metrics used to evaluate and compare the optimizers. The first metric used is the gain, G, to the minimum, where E f is the expectation value over all instances f , f F is the converged cost of the optimizer, f I is the initial cost (determined by the initial parameters) and f min is the ground-state energy.
where f max is the maximum possible energy. This metric gives a sense of the closeness to the global minima, as a percentage of the extent.

B. Configuring Optimizers
We evaluated the optimizers on 20 problems from 5 random initializations each, to increase the probability of reaching the ground-state by all optimizers. The initializations were kept the same between the local optimizers (L-BFGF-B, Nelder-Mead and meta-learner). Global optimizers used 5 different random initializations for each problem (evolutionary strategies and Bayesian optimization). L-BFGS-B and Nelder-Mead were implemented using Scipy [74], where the gradients for L-BFGS-B were computed by analytic means and quantum circuit simulation. These optimizers were left to converge at later iterations and not terminated at 100. Bayesian optimization was implemented with GPyOpt [75], and terminated at convergence. We implemented and configured the evolutionary strategies methods in-house. For all tests, a small population size of 20 was used to limit the number of calls to the simulator (sizes on the order of 100 are typical and may improve performance). Both MAX-2-SAT and Graph Bisection problems with QAOA used m = 60 bits to represent parameters. VQE simulations had more parameters to optimize, so m = 297 bits were used for these problems. All tests used a probability of crossover of P c = 0.9, and a probability of mutation of P m = 0.01. Wave Function, Sampling and Noisy simulations, defined in Section III. Optimizers: Evolutionary strategies (blue), Bayesian optimization (orange), Nelder-Mead (green), L-BFGS-B (red), meta-learner (purple). x-axis: Shared within a column, QPU iteration is number of times H has been evaluated. y-axis: Shared within a row, G, the gain, is the value computed by Equation (9), and represents the average progress toward the minimum from the initial evaluation of H . We recognise that this comparison is not apples to apples: L-BFGS-B and the meta-learner have access to the gradient, and make numerous calls to auxiliary quantum circuits (simulated in the same environment as the expectation value evaluation circuits) to compute the gradients. The number of calls to evaluate gradients of parameters is Ng = 2m, where m is the number of parameterized gates in the circuit. This is discussed further in Section V. Error bars are the standard error on the mean, σ f / √ n where n is the number of examples and σ f the standard deviation of the performance of the optimizers. The oscillatory behaviour shown by evolutionary strategies is a feature of the algorithm, and discussed in Section V. Note that negative values of G are observed, corresponding to on average performing worse than the initial evaluation.

C. Training the meta-learner
For the MAX-2-SAT and Graph Bisection problems the model was trained on just 200 problems, whereas in the case of optimizing Fermi-Hubbard models the metalearning model quickly converged and training was truncated at 100 problems. The loss function is given in Equation (2), where values ω t = 1 ∀ t are used. For the preprocessing of the gradients, the hyperparameter r in Equation (1) is set to 10. For all training an Adam optimizer [76] was used with a learning rate of 0.003, β 0 = 0.9, β 1 = 0.999, = 1.0 −8 and zero weight decay. These training schedules were consistent across simulation type (Wave Function, Sampling and Noisy). We included a 'curriculum' method, implemented in [12], whereby the time-horizon of the meta-learner is extended if an optimization ended in a near-optimal solution it was counted, regardless of whether it was found in a previous optimization. We found that if one optimizer performed well in one task, it performed well, relative to the other optimizers, in another (by this metric), so each bubble is not divided into each problem class. The right bar plot represents the summation across optimizers within a simulation type. The bottom bar plot represents the summation within an optimizer across simulation types. (N -Noisy, S -Sampling, W -Wave Function) slowly throughout the training cycle. This was started at 3 iterations and capped at 10, at the end of the training cycle. Optimization was terminated if it converged before the 100 iterations, under standard convergence criteria. Overall, 9 models were trained (3 simulation environments x 3 problem classes). Figure 7 shows the performance of the optimizers measured by the gain metric in the three simulation environments. The gain metric converges in the same sense as an optimizer converging on one problem instance, this is as expected given it is an average over many problem instances. A value close to 1 is desirable, indicating the ability of an optimizer to progress to the global minima from a starting point. Figure 8 shows the total number of near-optimal solutions found by each optimizer. We define near-optimal as finding a solution within 2% of the global optima computed by Equation (10). The closest comparable competitor to the meta-learner in these plots is L-BFGS-B, given both optimizers had access to the gradients. This is reflected in their performance, particularly in Figure 7.  9. This plot shows G, Equation (9), over the long timescale. This graph contains the same data as the subplot Graph Bisection, Noisy, in Figure 9. The only difference is for evolutionary strategies: Only the best individuals from each generation were plotted (i.e. the individuals every 20 iterations). In the cases that the optimization was terminated at QPU iteration < 1000, the final value of G was used to extend the data.

V. DISCUSSION & RESULTS
It is important to recognize that the comparison in Figure 7 has limited scope: The x-axis (QPU iteration) represents evaluations of H . The gradient-based optimizers (meta-learner and L-BFGS-B) evaluate auxiliary quantum circuits many times in order to compute the gradients. In order to highlight this, we have plotted the case of the worst performing meta-learner (Graph Bisection -Noisy) in Figure 9, where evolutionary strategies outperforms over the long timescale, though most optimizers are heavily damped by the parameter setting noise. Recognizing there are always limitations to comparing optimization methods, we draw conservative conclusions.

A. General Performance
Additionally to meta-learning functioning as an optimizer in variational quantum algorithms, we find competitive performance of this meta-learning algorithm, at small instance size, over a range of problem classes, using the gain metric G defined in Equation (9); see Figure 7.
The metric G was used to evaluate and compare the optimizers, though this value can hide significant features. For example, an optimizer that finds good (but not optimal) solutions frequently will perform better than an optimizer that finds bad solutions frequently and optimal solutions infrequently. There are other cases that the reader may have in mind. This particular example is addressed in Figure 8. The number of times the optimizer comes within 2% of the ground-state (across all problems), as calculated by Equation (10), is counted. We observe an expected reduction in performance as noise is increased; this is discussed further in the subsection below.

B. Noise
As expected, there is a reduction in performance for all optimizers as 'noise' increases: Performance is worse in Sampling than in Wave Function and is worse in Noisy than in Sampling. What is notable is that the meta-learner is more resilient to this increase in noise than other methods. For example, in Fermi-Hubbard model problems, L-BFGS-B performance reduces by 0.35 whereas the meta-learner only reduces by 0.2, from around the same starting point (Fermi-Hubbard models column, Figure 7). This pattern is repeated across problem classes, to varying degrees. We believe this is a promising sign that meta-learning will be especially useful in noisy near-term quantum heuristics implemented on hardware. In the case of simulation, we believe this resistance can be explained by the optimizer knowing how to find generally good parameters, having learned from noisy systems already. This needs to be distinguished from another potential benefit of these algorithms, where the models learn how to optimize in the presence of hardware-specific traits. In the latter case, the meta-learner may learn a model that accounts for hardware specific noise. Further, in Figure 8 we see a reduction in performance, measured by the total number of near-optimal solutions, for all optimizers. However, this effect is least apparent in the global optimizers (Bayesian optimization and evolutionary strategies) and the meta-learner. Additionally, the meta-learner finds significantly more near-optimal solutions (80) for Noisy simulation than the next best optimizer (evolutionary strategies -17). These are promising results on the potential use cases of these optimizers in hybrid algorithms implemented on noisy quantum hardware.

C. Evolutionary Strategies
Evolutionary strategies exhibit an oscillatory behavior when gain to global optima versus function call is plotted as shown in Figure 7. The first generation corresponds to a random search, then the fittest individual (i.e. best solution) found in the previous generation is evaluated first in the next generation. Hence, we observe a spike in performance every 21 evaluations (the size of the population plus the fittest individual). As we plot calls to the QPU on the x-axes of Figure 7, the performance of evolutionary strategies are inaccurately represented. We plot a long-time optimization in the worst case for the meta-learner (Graph Bisection -Noisy). In this case, we see G tend to a significantly higher value. Other analysis including different comparison metrics will be needed to determine the respective use cases of meta-learners vs evolutionary strategies. Indeed, while Figure 9 suggests that evolutionary strategies perform well for particularly hard problems, preliminary results in Figure 8 indicate that the meta-learner tends to outperform evolutionary strategies when searching for a near-optimal solution.

D. Problems and algorithms
The Fermi-Hubbard models were the simplest to solve (they are small problems confined to parameter values [-2,2]). This is reflected in the performance of the gradientbased optimizers. The global optimizers underperform. This is most likely a result of the size of the parameter space: Though the problem size (in terms of the number of variables) is smaller, there are significantly more parameters in this implementations we have considered of VQE (24) than QAOA (6).
Of the two classical optimization problems we consider, the Graph Bisection problem is harder than MAX-2-SAT, in the sense of worse classical approximability. While MAX-2-SAT can be approximated up to a constant factor, the best classical efficient algorithms known for Graph Bisection perform worse with increasing problem size [44,48]. This contrast appears in the performance of all optimizers: In general, every optimizer performs worse in Graph Bisection than in MAX-2-SAT by the gain metric. Bayesian optimization is the exception, showing significantly more robustness.

VI. CONCLUSION
In this work we compared the performance of a range of optimizers (L-BFGS-B, Nelder-Mead, Bayesian optimization, evolutionary strategies and a meta-learner) across problem classes (MAX-2-SAT, Graph Bisection and Fermi-Hubbard Models) of quantum heuristics (QAOA and VQE) in three simulation environments (Wave Function, Sampling and Noisy). We highlight two observations. The first is that the meta-learner outperforms L-BFGS-B (the closest comparable competitor) in most cases, when measured by an average percent gain metric G. Secondly, the meta-learner performs better than all optimizers in the Noisy environment, measured by a total number of near-optimal solutions metric D. We conclude that these are promising results for the future applications of these tools to optimizing quantum heuristics, because these tools need to be robust to noise and we are often looking for near-optimal solutions.
During the production of this work a related preprint [39] was posted online. In that preprint, the authors consider only gradient-free implementations of metalearners. Their training set is orders of magnitude larger, as the meta-learner is learning to optimize from more limited information. They make similar conclusions regarding the potential of these methods and suggest using them as an initialization strategy. We broadly agree with the conclusions reached therein.
The meta-learning methods evaluated here are relatively new and are expected to continue to improve in design and performance [54]. There are several paths forward, we highlight some here. Though there is no investigation into the scaling of meta-learner performance to larger problem sizes, this in part is limited by the inability to simulate large quantum systems quickly, and exacerbated by the further burden of computing the gradients.
It is an open question as to how meta-learners will perform with quantum heuristics applied to larger problem sizes. In a closely related vein, these methods will be explored on hardware implementations, for two reasons. The first is that quantum computing will soon be beyond the realm of reasonable simulation times, and testing these algorithms on systems with higher number of variables will have to be done on hardware. The second is that these meta-learners may be able to learn hardware-specific features. For example, in this work the meta-learner is a single model applied to different parameters. This approach is called 'coordinatewise'. If instead applied in a 'qubitwise' fashion, where different models are trained for parameters corresponding to each qubit in a given hardware graph, there may be local variability in the physics of each qubit that the meta-learner accounts for in its model and optimization.
In terms of further investigations into the specifics of the problems and quantum heuristics considered, we emphasize that our QAOA implementation of Graph Bisection used a different type of mixer and initial state than MAX-2-SAT. An important question to answer is to what degree the differences in performance we observed between MAX-2-SAT and Graph Bisection are due to the change of mixer and initial state, as opposed to the change of problem structure. Additional possible mixer variants and initial states for Graph Bisection are suggested in [21], which we expect to further affect QAOA performance, and hence also affect the performance of our parameter optimization approaches. An important open area of research is to better characterize the relative power of different QAOA mixers and the inherent tradeoffs in terms of performance, resource requirements, and the difficulty of finding good algorithm parameters. In this direction, recent work [49] has demonstrated that superposition states may perform better than computational basis states as QAOA initial states.
Finally, heuristics play a prominent role in solving realworld problems: They provide practical solutions -not necessarily optimal -for complex problems (where an optimal solution is prohibitively expensive), with reasonable amount of resources (time, memory etc.). Therefore, we see significant potential for applications of quantum heuristics, implemented not only on near-term quantum devices -especially for variational quantum algorithms -but also for hybrid computing in fault-tolerant architectures. Thus it is imperative to characterize the classical components, such as the meta-learner, that learn properties of quantum devices towards the deployment of effective quantum heuristics for important practical applications.