Simulation environments
We compare optimization methods in “Wave Function,” “Sampling,” and “Noisy” simulation environments. The Wave Function case is an exact wave function simulation. For Sampling, the simulation emulates sampling from a hardware-implemented quantum circuit, where the variance of the expectation value evaluations is dependent on the number of samples taken from the device. In these experiments, we set the number of shots (samples from the device) to 1024.
Lastly, in the Noisy case, we have modelled only parameter setting noise in an exact wave function simulation, which is a coherent imperfection resulting in a pure state.Footnote 1 We assume exact, up to numerical precision, computation of the expectation value (via some theoretical quantum computer which can compute the expectation value of a Hamiltonian given a state up to arbitrary precision). Then, for each single-qubit rotation gate, we added normally distributed, standard deviation σ = 0.1, noise to the parameters at each optimization step. In order to determine σ, we evaluate the relationship between the fidelity of an arbitrary rotation (composed of three single-qubit Pauli rotation gates RZ(α)RY(β)RZ(γ)), around the Bloch sphere and the noise σ (see Fig. 4). Assuming industry standard single-qubit gate rotations of 99% (Krantz et al. 2019) a value of σ = 0.1 is approximated resulting in the noise in rotations illustrated in Fig. 5 . All simulations were performed with Rigetti Forest (2019) simulators and circuit simulations performed on an Intel(R) Core(TM) i7-8750H CPU with 6 cores.
Optimizers
Local optimizers
Nelder-Mead and L-BFGS-B are gradient-free and gradient-based approaches, respectively, which are standard local optimizers (Guerreschi and Smelyanskiy 2017; Wecker et al. 2016; 2015; Nannicini 2019). Local optimizers have a notion of location in the solution space. They search for candidate solutions from this location. They are usually fast and are susceptible to finding local minima. L-BFGS-B is a local optimizer and has access to the gradients. Out of all optimizers chosen it is the closest to the meta-learner in terms of information available to the optimizer and computational burden (i.e., the cost of computing the gradients). Nelder-Mead was chosen as it appears throughout the literature (Peruzzo et al. 2014; Guerreschi and Smelyanskiy 2017; Verdon et al. 2017; Romero et al. 2018) and provides a widely recognized benchmark.
Evolutionary strategies
Evolutionary strategies are a class of global black-box optimization techniques: A population of candidate solutions (individuals) are maintained, which are evaluated based on some cost function. Genetic algorithms and evolutionary strategies have been used for decades. More recent work has shown these techniques to be competitive in problems of reinforcement learning (Vidnerová and Neruda 2017; Salimans et al. 2017).
All implementations of evolutionary strategies are population-based optimizers. In the initial iteration, the process amounts to a random search. In each iteration, solutions with lower costs are more likely to be selected as parents (though all solutions have a nonzero probability of selection). Different methods for selecting parents exist, but we used binary tournament selection, in which two pairs of individuals are selected, and the individual with the lowest cost from each pair is chosen to be a parent.
In more precise terms, parents are the candidate solutions selected to participate in crossover. Crossover takes two parent solutions and produces two children solutions by randomly exchanging the bitstring defining the first parent with the second. Each child replaces its parent in the population of candidate solutions. The process is repeated, so costs for each child are evaluated, and these children are used as parents for the next iteration (Beasley et al. 1993). In our case, the bitstring is divided into n subsections, where n is the number of parameters passed to the quantum heuristic. Each subsection is converted to an integer using Gray encoding and then interpolated into a real value in the range [−π/2,π/2]. Gray codes are used as they avoid the Hamming walls found in more standard binary encodings (Charbonneau 2002).
It is the bitstrings that are operated on by the genetic algorithm. When two individuals are selected to reproduce, a random crossover point, bc is selected with probability Pc. Two children are generated, one with bits left of bc from the first parent and bits to the right of bc originating from the second parent. The other child is given the opposite arrangement. Intuitively, if bc is in the region of the bitstring allocated to parameter ϕk, the first child will have angles identical to the first parent before ϕk and angles identical to the second parent after ϕk. Again, the second child has the opposite arrangement. The effect on parameter ϕk is more difficult to describe. Finally, after crossover is complete, each bit in each child’s bitstring (chromosome) is then flipped (mutated) with probability Pm. Mutation is useful for letting the algorithm explore candidate solutions that may not be accessible through crossover alone.
Evolutionary strategies are highly parallelizable, robust, and relatively inexpensive (Salimans et al. 2017) making them a good candidate for the optimization of quantum heuristics.
Meta-learning on quantum circuits
The meta-learner used in this work is an LSTM, shown unrolled in time in Fig. 1. Unrolling is the process of iteratively updating the inputs, x, cell state, and hidden state, referred to together as s, of the LSTM. Inputs to the model were the gradients of the cost function w.r.t. the parameters, preprocessed by methods outlined in the original work (Andrychowicz et al. 2016). At each time-step, they are
$$ x^{t} = \begin{cases} \left( \frac{\log({\left|\nabla \langle H \rangle^{t}\right|})}{r},\text{sign}(\nabla \langle H \rangle^{t})\right) & \text{if}\ \left|{\nabla \langle H \rangle^{t}}\right|\geq e^{-r} \\ (-1,\exp(r)\nabla \langle H \rangle^{t}), & \text{otherwise} \end{cases} $$
(1)
where r is a scaling parameter, here set to 10, following standard practice (Andrychowicz et al. 2016; Ravi and Larochelle 2016). The terms ∇〈H〉t are the gradients of the expectation value of the Hamiltonian at time-step t, with respect to the parameters ϕt. This preprocessing handles potentially exponentially large gradient values while maintaining sign information. Explicitly, the meta-learner used here is a local optimizer. At some point ϕt in the parameter space, where t is the time-step of the optimization, the gradients xt are computed and passed to the LSTM as input. The LSTM outputs an update Δϕt, and the new point in the parameter space is given by ϕt+ 1 = ϕt + Δϕt. It is possible to use these models for derivative-free optimization (Chen et al. 2017), however, given that the gradient evaluations can be efficiently performed on a quantum computer, scaling linearly with the number of gates, and that the optimizers usually perform better with access to gradients, we use architectures here that exploit this information. In reference McClean et al. (2018), the authors show that the gradients of the cost function of parameterized quantum circuits may be exponentially small as a function of the number of qubits, the result of a phenomena called the concentration of quantum observables. In cases where this concentration is an issue, there may be strategies to mitigate this effect (Grant et al. 2019), though it is not an issue in the small problem sizes used here.
Though only one model (a set of weights and biases) defines the meta-learner, it was applied in a “coordinatewise” way: For each parameter a different cell state and hidden state of the LSTM are maintained throughout the optimization. Notably, this means that the size of the meta-learning model is only indirectly dependent on the number of parameters in the problem. We used a gradient-based approach, exploiting the parameter-shift rule (Schuld et al. 2019) for computing the gradients of the loss function with respect to the parameters. These were used at both training and test time.
All model training requires some loss function. We chose the summed losses,
$$ \mathcal{L}(\omega) = \mathbb{E}_{f} \left[ \sum\limits_{t=0}^{T} \omega_{t} f(\phi_{t}) \right], $$
(2)
where \(\mathbb {E}_{f}\) is the expectation over all training instances f and T is a time-horizon (the number of steps the LSTM is unrolled before losses from the time-steps t < T are accumulated and backpropagated, and the model parameters updated). The hyperparameters ωt are included, though are set to ωt = 1 for all t in these training runs. This can be adjusted to weigh finding optimal solutions later in the optimization more favorably, a practice for balancing exploitation and exploration. In situations where exploration is more important, other loss functions can be used, such as the expected improvement or observed improvement (Chen et al. 2017). However, in this instance, we chose a loss function to rapidly converge, meaning fewer calls to the QPU. This has the effect of converging to local minima in some cases, though we found that this loss function performed better than the other gradient-based optimizer (L-BFGS-B) for these problems.
Problems
Free Fermions model
Hubbard Hamiltonians have a simple form, as follows:
$$ \begin{array}{@{}rcl@{}} H = &- t {\sum}_{\langle i,j \rangle}{\sum}_{\sigma=\{\uparrow,\downarrow\}}(a^{\dagger}_{i, \sigma} a_{j, \sigma} + a^{\dagger}_{j, \sigma} a_{i, \sigma}) \\ &+ U {\sum}_{i} a^{\dagger}_{i, \uparrow} a_{i, \uparrow} a^{\dagger}_{i, \downarrow} a_{i, \downarrow}- \mu {\sum}_{i} {\sum}_{\sigma=\{\uparrow,\downarrow\}} a^{\dagger}_{i, \sigma} a_{i, \sigma}, \end{array} $$
(3)
where \(a_{i,\sigma }^{\dag }, a_{i,\sigma }\) are creation and annihilation operators, respectively, of a particle at site i with spin σ. In this model there is a hopping term t, a many body interaction term U and an onsite chemical potential term μ. This model gained importance as being a possible candidate Hamiltonian to describe superconductivity in cuprate materials. However, recent numerical studies have shown that there are some significant differences between the model and what is seen in experiments, such as the periodicity of charged stripes that the model supports (LeBlanc et al. 2015; Schulz 1993; Huang et al. 2017). However, the model is quite interesting itself, with many different phases of interest. The model is also quite difficult to solve, especially when going to large lattice sizes and large values of U/t. This has led to many studies and much method development on classical computers, and is still widely researched today.
For VQE, we look for the ground state of the simplified spinless three-site Free Fermions model with unequal coupling strengths tij ∈ [− 2,2] and U = μ = 0, Fig. 6. The Hamiltonian of this model can be mapped through the Jordan-Wigner transformation (Jordan and Wigner 1928) to the qubit Hamiltonian
$$ \begin{array}{@{}rcl@{}} H_{FH} &=& \frac{1}{2}\left( t_{12}\hat{X}_{1}\hat{X}_{2} + t_{12}\hat{Y}_{1}\hat{Y}_{2} +t_{23} \hat{X}_{2}\hat{X}_{3}\right. \\&&\left.+ t_{23}\hat{Y}_{2} \hat{Y}_{3} + t_{13}\hat{X}_{1}\hat{Z}_{2}\hat{X}_{3} + t_{13}\hat{Y}_{1}\hat{Z}_{2} \hat{Y}_{3}\right) \end{array} $$
(4)
where \(\hat {X}\), \(\hat {Y}\), and \(\hat {Z}\) are the Pauli-X, Pauli-Y, and Pauli-Z matrices, respectively. Based on the results of Woitzik (2018, 2020), we use a circuit composed of 3 blocks. Each block consists of three single-qubit rotations RZ(α)RY(β)RZ(γ) applied to all qubits, followed by entangling CNOT gates acting on qubits (1,2) and (2,3), where the first entry is the control qubit and the second is the target.
MAX-2-SAT
Given a Boolean formula on n variables in conjunctive normal form (i.e., the AND of a number of disjunctive two-variable OR clauses), MAX-SAT is the NP-hard problem of determining the maximum number of clauses which may be simultaneously satisfied. The best classical efficient algorithm known achieves only a constant factor approximation in the worst case, as deciding whether a solution exists that obtains better than a particular constant factor is NP-complete (Papadimitriou 1994). For MAX-2-SAT, where each clause consists of two literals, the number of satisfied clauses can be expressed as
$$ C = \sum\limits_{(i,j)\in E} \Tilde{x}_{i} \lor \Tilde{x}_{j} $$
(5)
where x~i in each clause represents the binary variable xi or its negation, and E is the set of clauses. We use an n-qubit problem encoding where the j th qubit logical states |0〉j,|1〉j encode the possible values of each xj. Transforming to Ising spin variables (Hadfield 2018) and substituting with Pauli-Z matrices lead to the cost Hamiltonian
$$ \widehat{C} = {\sum}_{(i,j)\in E} \frac{1}{4} (1 \pm \hat{Z}^{(i)})(1\pm\hat{Z}^{(j)}) $$
(6)
which is minimized when the number of satisfied clauses is maximized. The sign factors + 1 or − 1 in \(\widehat {C}\) correspond to whether each clause contains xi or its negation, respectively. Note that C and \(\widehat {C}\) are not equivalent; C gives a maximization problem, while \(\widehat {C}\) gives a minimization problem, with the same set of solutions.
For our QAOA implementation of MAX-2-SAT we use the original (Farhi 2014) initial state \(|{s}\rangle =\tfrac 1{\sqrt {2^{n}}}{\sum }_{x} |x\rangle \), phase operator \(U_{P}(\widehat {C},\gamma )=\exp (-i\gamma \widehat {C})\), and mixing operator \(U_{M}(\beta )=\exp (-i\beta {\sum }_{j=1}^{n} \hat {X}^{(j)})\). The instances we consider below have n = 8 qubits, 8 clauses, and QAOA circuit depth p = 3. We further explore instances with n = 12 and p = 5 (Fig. 9).
Graph Bisection
Given a graph with an even number of nodes, the Graph Bisection problem is to partition the nodes into two sets of equal size such that the number of edges across the two sets is minimized. The best classical efficient algorithm known for this problem provably yields only a \(\log \)-factor worst-case approximation ratio (Krauthgamer and Feige 2006). Both this problem and its maximization variant are NP-hard (Papadimitriou 1994).
For an n-node graph with edge set E we encode the possible node partitions with n binary variables, where xj encodes the placement of the j th vertex. In this encoding, from the problem constraints the set of feasible solutions is encoded by strings x of Hamming weight n/2. The cost function to minimize can be expressed as
$$ C = \sum\limits_{(i,j) \in E} \mathrm{XOR(x_{i}, x_{j})} $$
(7)
under the condition \({\sum }^{n}_{j=1} x_{j} = n/2\). Transforming again to Ising variables gives the cost Hamiltonian
$$ \widehat{C} = \frac{1}{2} \sum\limits_{(i,j) \in E} (1 - \hat{Z}^{(i)}\hat{Z}^{(j)}). $$
(8)
A mapping to QAOA for this problem was given in (Hadfield et al. 2019, App. A.3.2) from which we derive our construction. We again encode possible partitions x with the n-qubit computational basis states |x〉. For each problem instance we uniformly at random select a string y of Hamming weight n/2 and use the feasible initial state |y〉. The phase operator \(U_{P}(\widehat {C},\gamma )=\exp (-i\gamma \widehat {C})\) is constructed in the usual way from the cost Hamiltonian. For the mixing operator we employ a special case of the XY -mixer proposed in Hadfield et al. (2019). This class of mixers affects state transitions only between states of the same Hamming weight, which will importantly restrict the quantum state evolution to the feasible subspace. For each node \(j=1,\dots ,n\), we define the XY partial mixer
$$U_{j}(\beta)=\exp\left( -i\beta \left( \hat{X}^{(j)} \hat{X}^{(j+1)}+ \hat{Y}^{(j)} \hat{Y}^{(j+1)} \right) \right)$$
with σ(n+ 1) := σ(1). We define the overall mixer to be the ordered product \(U_{M}(\beta )= U_{n}(\beta ) \dots U_{2}(\beta )U_{1}(\beta )\). Observe that as each partial mixer preserves feasibility, so does UM(β), and so QAOA will only output feasible solution samples. We consider problem instances with n = 8 qubits, 8 edges, and QAOA circuit depth p = 3.