Quantum Optimization for Training Quantum Neural Networks

Training quantum neural networks (QNNs) using gradient-based or gradient-free classical optimisation approaches is severely impacted by the presence of barren plateaus in the cost landscapes. In this paper, we devise a framework for leveraging quantum optimisation algorithms to find optimal parameters of QNNs for certain tasks. To achieve this, we coherently encode the cost function of QNNs onto relative phases of a superposition state in the Hilbert space of the network parameters. The parameters are tuned with an iterative quantum optimisation structure using adaptively selected Hamiltonians. The quantum mechanism of this framework exploits hidden structure in the QNN optimisation problem and hence is expected to provide beyond-Grover speed up, mitigating the barren plateau issue.

Training quantum neural networks (QNNs) using gradient-based or gradient-free classical optimisation approaches is severely impacted by the presence of barren plateaus in the cost landscapes.In this paper, we devise a framework for leveraging quantum optimisation algorithms to find optimal parameters of QNNs for certain tasks.To achieve this, we coherently encode the cost function of QNNs onto relative phases of a superposition state in the Hilbert space of the network parameters.The parameters are tuned with an iterative quantum optimisation structure using adaptively selected Hamiltonians.The quantum mechanism of this framework exploits hidden structure in the QNN optimisation problem and hence is expected to provide beyond-Grover speed up, mitigating the barren plateau issue.A video animation of the circuit construction is available at https://youtu.be/RVWkJZY6GNY.(This is vector image and best view with the zoom feature in standard PDF viewers.)Note: 1.In all figures of this Paper, we omit the minus signs in all time-evolution-like terms (i.e.exponential of a Hamiltonian e −iHt ) for sake of brevity and space.2. Some quantum registers are not depicted in this figure due to the limitation of space.

A. Quantum Neural Networks
Quantum Neural Networks (QNNs) are considered to be a leading candidate to achieve a quantum advantage in noisy intermediate-scale quantum (NISQ) devices.A QNN consists of a set of parameterized quantum gates within an predefined circuit ansatz.The design of the ansatz together with the value of the gate parameters determine the outcome of the QNN.In order to successfully perform certain tasks, QNNs must be trained to find optimal parameters for generating desired outcomes.In the majority of QNN research, the training is carried out by employing variational hybrid quantum-classical algorithms [1], in which the parameters are optimized by a classical optimizer using gradient-based or gradient-free approaches.In this paper, we achieve a scalable, maximally quantum pipeline of the applications of QNNs by replacing the classical optimizer by quantum optimizer.In short, we employ quantum optimisation methods for training QNNs.
There are two main avenues for the application of QNNs.The first uses QNNs to generate quantum states that minimize the expectation value of a given Hamiltonian, such as the case in Variational Quantum Eigensolvers (VQE) [2] for chemistry problems or Quantum Approximate Optimization Algorithms (QAOA) [3] for combinatorial optimization problems.The second path uses QNNs as data-driven machine learning models to perform discriminative [4][5][6] and generative [7][8][9][10][11] tasks for which QNNs could have more expressive power than their classical counterparts [12].Though an ever increasing amount of effort is being put into QNN research, there is evidence that they will be difficult to train due to flat optimisation landscapes called barren plateaus [13].
The barren plateau issue has spawned several studies on the strategies to avoid them, including layerwise training [14], using local cost functions [15], correlating parameters [16], and pre-training [17], among others [18][19][20].Such strategies give hope that the variational quantum-classical algorithms may avoid the exponential scaling due to the barren plateau issue.However, it has been shown that these strategies do not avoid another type of Barren Plateaus induced by hardware noise [21], and some strategies may lack theoretical grounding [22].In addition to noise, there is also other sources of barren plateaus due to entanglement growth [23].Moreover, it has been shown that gradient-free approaches are also adversely affected by barren plateaus [24].
Our work presents a new alternative to training QNNs with a maximally coherent (i.e., quantum) protocol.

B. Prior work
The above noted results indicate that training QNNs using classical optimisation methods have unprecedented challenges as the system scales up.Therefore, one seeks to leverage alternative optimisation methods for training QNNs.Indeed, preliminary attempts have been made in this direction.Verdon et al. proposed a QAOA-like training protocol for QNNs [25] and Gilyén et al. developed a quantum algorithm for calculating gradients faster than classical methods [26].In these two works, to cast the optimisation problem of training QNNs into the context of quantum optimisation, the network parameters in the QNN are quantizedmoved from being classical to being stored in quantum registers, which are in addition to those upon which the QNN is performing its computation.The quantized parameters are used as control registers of the parameterized gates on the QNN registers.The parameters can now be in superposition, which one hopes would allow for a quantum parallelism-type computation of the QNN with multiple parameter configurations.
In Ref. [25], the quantum training process can be described as the state evolution in the joint Hilbert space of the parameter register and the QNN register.Their quantum training protocol consists of two alternating operations in a QAOA fashion -the first operation acts on both the parameter register and QNN register to encode the cost function of QNN onto a relative phase of the parameter state.The second operation acts only on the parameter register and it is a variant of the original QAOA Mixers, tailored for the case that the parameters in the QNN are continuous variables.These two operation can be mathematically expressed as e −iγiC(θ) and e −iβiH M , where θ are the parameters of QNN, C(θ) is the cost function of the QNN, and γ i and β i are tunable hyperparameters, H M is the Mixer Hamiltonian.By heuristically tuning the hyperparameters, the quantum training is expected to home in on the optimal parameters of the QNN after several iterations of the QAOA alternating operations.We illustrate the alternating operations of their quantum training in Fig. 2.   Despite being the pioneering application of the QAOA method for training QNNs, the protocol in Ref. [25] has some limitations.In the phase encoding operation, the parameter register and the QNN register are generally always entangled.This will have the effect of causing phase decoherence in the parameter eigenbasis.To minimize the effect of this decoherence, the tuneable hyper-parameter γ i must be sufficiently small -in other words, the phase encoding is coherent only in the first order of γ i .To overcome this limitation -to enact phase encoding operation with arbitrary hyperparameters -the phase encoding operation with a small hyperparameter ∆γ should be repeated an excessive amount of times.This simulates the phase encoding operation with a large hyperparameter γ via e −iγC(θ) = e −i∆γC(θ) e −i∆γC(θ) e −i∆γC(θ) ...These repetitions will yield large overhead in the complexity of the algorithm.In Ref. [26], a phase oracle is designed for the phase encoding and can achieve it coherently and efficiently.(Note that throughout this paper, the term phase oracle has different meaning than the one in Ref. [26], our phase oracle stands for the term fractional phase oracle in Ref. [26].)Nevertheless, they did not utilise the phase encoding as a component of QAOA routine to accomplish a fully quantum training algorithm for QNNs.Instead they use the phase oracle as a component of quantum evaluation of the gradient of a QNN, which serves for gradient based classical training of QNNs.However this improvement will not be practically useful due to the barren plateau issue of QNNs.
In this paper, we devise an improved framework for training QNNs, taking advantage of the well-established parts in Refs.[25] and [26], while eliminating the shortcomings.A schematic of our quantum training framework for QNNs is depicted in Fig. 3.More specifically, we achieve the following: that act across different parameters).This potentially leads to a dramatic shortening of the depth of QAOA layers while significantly improving the quality of the solution (the optimal QNN parameters found by the QAOA routine).[25] and [26], while eliminating their shortcomings.We replace the phase encoding operations in QAOA-like protocol of Ref. [25](as depicted in Fig 2) by the phase oracle in Ref. [26].For the mixers in the QAOA-like routine, we allow different mixers for each layer, similar to Ref. [27].In this figure, the color of each block represents the nature of the corresponding Hamiltonian: different color corresponds to different Hamiltonian (One can see that the Cost Hamiltonian is the same throughout the training whereas the mixer varies from layer to layer).The mixers pool contains the proper mixers tailored to our QNN training problem.These rules also apply to the other circuit schematic in this paper.
By making the mixers flexible and adaptive to specific optimisation problems, it is demanding to find an efficient way of determining the best sequence of mixers and the optimized hyperparameters.To address these we adopt machine learning approaches (in particular, Recurrent Neural Networks and Reinforcement Learning) as proposed in Refs.[17,[28][29][30].The quantum mechanism of this framework is the best candidate to exploit hidden structure in the QNN optimisation problem, which would provide beyond-Grover speed up and mitigate the barren plateau issues for training QNNs.

C. Paper Outline
The remainder of this paper is organized as follows: in Section II we review some essential preliminaries -particularly on the details of QAOA and its variants, from which we designed a new variant of QAOA tailored for our QNN training problem.Section II C introduces a way of quantising parameters of a QNNthat is, we show how to create superposition of a QNN with multiple parameter configurations.In Section III we present quantum training by Grover adaptive search as a baseline prior to our quantum training framework using QAOA.In Section IV we present the details of our framework including how to implement the phase oracle, that can achieve coherent phase encoding of the cost function of a QNN, and which mixers to use for the QAOA routine, as well as the strategy to determine the mixers sequence and the optimize their hyper-parameters.Section V presents the deployment potential of our quantum training to a variety of application including training VQE, learning a pure state, and training a quantum classifier.The final section summarise our work and provides outlook for future work.

Zoo of Quantum Optimisation Algorithms
For completeness and context, we list some typical quantum optimisation algorithms in Table .I, including the primitive ones (adiabatic, quantum walks, QAOA, Grover adaptive search), their hybridizations, and their variants.In this paper, for the training of QNNs, we focus on utilising QAOA and its variants as well as Grover adaptive search, which we will review in the following subsections.

Primitives
Adiabatic Quantum Walk QAOA Grover adaptive search
Before that, however, some remarks on the fundamental differences of the adiabatic and QAOA protocols are in order.QAOA can be seen as a "trotterized" version of adiabatic evolution: the mixer Hamiltonians being the initial Hamiltonian in the analogous adiabatic algorithm, and the cost Hamiltonians being the final Hamiltonian.However short-depth QAOA is not really the digitized version of the adiabatic problem, but rather an ad hoc ansatz.In Ref. [41] it is shown that QAOA is able to deterministically find the solution of specially constructed optimization problems in cases where quantum annealing fail.We emphasise that QAOA is an interference-based algorithm such that non-target states interfere destructively while the target states interfere constructively.In Fig. 4 we depict this interference process of QAOA.

QAOA and its variants
In this section, we review the original quantum approximation optimization algorithm (QAOA) proposed in Ref. [3] and its variants.Consider an unconstrained optimization problem on n-bit strings z = (z 1 , z 2 , z 3 , ....z n ) where z i ∈ {−1, 1} We seek the optimal bit string z that maximizes (or minimizes) a cost function C(z) .Given the cost function C(z) of a problem instance, the algorithm is characterized by two Hamiltonians: the cost Hamiltonian H C and the Mixing Hamiltonian H M .The cost Hamiltonian H C encodes the cost function C(z) to be optimized, and acts on n-qubit computational basis states as The mixing Hamiltonian H M is chosen as to be where X j is the Pauli X operator acting on the jth qubit.The initial state is the even superposition state of all possible solutions: |s = 1

√
2 n z |z .The QAOA algorithm consists of alternating time evolution under the two Hamiltonians H C and H M for p rounds, where the duration in round j is specified by the parameters γ j and β j , respectively.After all p rounds, the state becomes The alternating operations can be illustrated as in Fig. 5. Finally a measurement in the computational basis is performed on the state.Repeating the above state preparation and measurement, the expected value of the cost function, can be estimated from the samples produced from the measurements.
The above steps are then repeated altogether, with updated sets of time parameters γ 1 , . . ., γ p , β 1 , . . ., β p .Typically a classical optimization loop (such as gradient descent) is used to find the optimal parameters that maximize(or minimize) the the expected value of the cost function C .Then measuring the resulting state of the optimal parameters provide an approximate solution to the optimization problem.
There has been a lot of progress on QAOA recently on both the experimental and theoretical fronts.There is evidence suggesting that QAOA may provide a significant quantum advantage over classical algorithms [42,43], and that it is computationally universal [44,45].Despite these advances, there are limitations of QAOA.The performance improves with circuit depth, but circuit depth is still limited in near-term quantum processors.Moreover, deeper circuits translate into more variational parameters, which introduces challenges for the classical optimizer in minimizing the objective function.Ref. [46] show that the locality and symmetry of QAOA can limit its performance.These issues can be attributed to the form of the QAOA ansatz.A short-depth ansatz that is further tailored to a given combinatorial problem could therefore address the issues with the standard QAOA ansatz.However, identifying such an alternative is a highly non-trivial problem given the vast space of possible ansatzes.Farhi et al. [47] allowed the mixer to rotate each qubit by a different angle about the x-axis and modified the cost Hamiltonian based on hardware connectivity.This modification was made primarily out of hardware capability concerns with the hope that superior-than-classical performance can be experimentally verified.
LH-QAOA.In Ref. [48] Hadfield et al. considered alternative mixers including entangling ones on two qubits.The selection of mixers is based on the criteria of preserving the relevant subspace for the given combinatorial problem, for which they entitled it Local Hamiltonian-QAOA (LH-QAOA).Here we depict the quantum circuit schematic of LH-QAOA in Fig. 6.
Quantum circuit schematic of the operations in LH-QAOA.The overall process of LH-QAOA is similar to that of the original QAOA in Fig. 5, where the difference is that the mixer of LH-QAOA contains entangling an mixer Hamiltonian on two qubits.These are represented by the HM,i blocks with various colors in the figure.Note that in order to avoid an excessive amount of hyper-parameters, Hadfield et al. [48] choose the βj for each HM,i to be the same in every layer.
QDD.In Refs.[25,49] Verdon et al. adjusted the mixers for continuous optimization problem in which the parameters to be optimized are continuous variables.In the original QAOA ansatz, the mixer is chosen to be single-qubit X rotations applied on all qubits.These constitute an uncoupled sum of generators of shifts in the computational basis.Similarly, the appropriate mixers in the continuous case should shift the value for each digitized continuous variables stored in independent registers.They entitled it Quantum Dynamical Descent (QDD).Here we depict the quantum circuit schematic of QDD in Fig. 7.
ADAPT-QAOA.LH-QAOA and QDD showcase the potential of problem-tailored mixers, but do not provide a general strategy for choosing mixers for different optimization problems.In Ref. [27] Zhu et al.
e iβpX FIG. 7. Quantum circuit schematic of QDD.QDD solves optimization problems of continuous variable.In this figure, θi are the continuous variables to be optimized in the training, where each θi is digitized into binary form and stored in an independent register.The overall process of QDD is similar to that of the original QAOA, where the difference is that the mixer of QDD with Hamiltonian S is acting on the registers of θi (rather than single qubits as in the original QAOA).The effect of the mixer in QDD is to shift the value for each θi.
replaced the fixed mixer H M by a set of different mixers A k that change from layer to layer.They entitled this variation of QAOA as ADAPT-QAOA.This adaptive approach would dramatically shorten the depth of QAOA layers while significantly improving the quality of the solution.Here we depict the quantum circuit schematic of ADAPT-QAOA in Fig. 8. Compared to the original QAOA, allowing Y mixers and entangling mixers enables ADAPT-QAOA to dramatically improve algorithmic performance while achieving rapid convergence for problems with complex structures.This effect of the adaptive mechanism can be illustrated in Fig. 9.

Parameter Hilbert Space
Parameter Hilbert Space As can be seen, the ADAPT-QAOA takes much fewer iterations to reach a closer point to the target state.This illustrates that compared to the original QAOA, allowing alternative mixers enables ADAPT-QAOA to dramatically improve algorithmic performance while achieving rapid convergence.
The advantage of this adaptive ansatz may come from the counter-diabatic (CD) driving mechanism.Numerical evidence shows that the adaptive mixer sequence chosen by the algorithm coincides with that of "shortcut to adiabaticity" by CD driving [27].Inspired by the CD driving procedure, another variant of QAOA, CD-QAOA [29], also uses an adaptive ansatz to achieve similar advantages.CD-QAOA is designed for preparing the ground state of quantum-chaotic many-body spin chains.By using terms occurring in the adiabatic gauge potential as additional control unitaries, CD-QAOA can achieve fast high-fidelity many-body control.
Inspired by above variants of QAOA, we design a new variant of QAOA tailored for our QNN training problem.In our case, for QNN training, the parameters we are optimizing (the angles of rotation gates) are continuous (real) values.Therefore, the choice of mixer Hamiltonian has to be adapted (as in QDD).We also want take advantage of including alternative mixers and allowing adaptive mixers for different layers (as in ADAPT-QAOA).Thus, the proper QAOA ansatz for our QNN training problem should be an adaptive continuous version of QAOA, which we call we call AC-QAOA.Here we depict the the quantum circuit schematic of AC-QAOA in Fig. 10.

Grover Adaptive Search
Grover's algorithm is generally used as a search method to find a set of desired solutions from a set of possible solutions.Dürr and Høyer presented an algorithm based on Grover's method that finds an element of minimum value inside an array of N elements using on the order of O( √ N ) queries to the oracle [50].Baritompa et al. [51] applied Grover's algorithm for global optimization, which they call Grover Adaptative Search (GAS).GAS has been applied in training classical neural networks [52] and polynomial binary optimization [53].In the following we outline GAS.Consider a function f : X → R, where for ease of presentation assume X = {0, 1} n .we are interested in solving min x∈X f (x).The main idea of GAS is to construct an "adaptive" oracle for a given threshold y such that it flags all states x ∈ X satisfying f (x) < y, namely the oracle marks a solution x if and only if another boolean function g y satisfies g y (x) = 1, where The oracle O Grover then act as We use Grover search to find a solution x with a function value better than y.Then we set y = f (x) and repeat until some formal termination criteria is met -for example, based on the number of iterations, time, or progress in y.

B. Swap test, Hadamard test, and the Grover operator
This section introduces the Swap test, Hadamard test, and their corresponding Grover operators, which will be used in the phase encoding of the cost function of QNNs.

Swap Test and its Grover operator
Let |p j , |t be the resulting quantum states of unitary operators P j and T , respectively -that is, |p j = P j |0 ⊗n and |t = T |0 ⊗n .The swap test is a technique that can be used to estimate | p j |t | 2 [54].The circuit of swap test is shown in Fig. 11.Here we present an alternative form of swap test: instead of applying the swap operation on two quantum states, the circuit in this figure simulate the "swap" effect by applying two unitaries Pj, T on two registers in different order controlled by an ancilla qubit.The "anti-control" symbol is defined as: when the control qubit is in state |0 , the unitary being controlled is executed; when the control qubit is in state |1 , the unitary being controlled is not executed.

Uj
We denote the unitary of the Swap test circuit (dotted green box in Fig. 11) as U j , which can be written as ( The output state from U j is denoted as |φ j : Rearranging the terms we have Denote |u j and |v j as the normalized states of |p j |t + |t |p j and |p j |t − |t |p j respectively.Then there is a real number θ j ∈ [π/4, π/2] such that 2, therefore we have: From Eq. 7 and Eq. 6 we can see that the value of | p j |t | 2 is encoded in the amplitude of the output state |φ j of swap test.This will be used in the amplitude encoding of QNN cost function which is a crucial component of the quantum training.
Applying the Schmidt decomposition method to state |φ j we arrive at where One can construct a Grover operator using U j as follows: where 2n+1) can be implemented as the circuit shown in Fig. 12.The circuit representation of G j is shown in Fig. 13.It is easy to check that |w ± j are the eigenstates of G j .-that is, Recall from Eq. 7 the value of | p j |t | 2 is encoded in the phase of the eigenvalue of G j .This will be used in the phase encoding of QNN cost function which is a crucial component of the quantum training.

Hadamard Test and its "Grover operator"
Similar to the swap test, the Hadamard test is a technique that can be used to estimate 0| P † j T P j |0 , for two unitary operators P j and T (assuming T is Hermitian).The circuit of Hadamard test is shown in Fig. 14.
We denote the unitary of the Hadamard test circuit (the dotted green box in Fig. 14) as U j and the output state from U j as Rearranging the terms we have Denote |u j and |v j as the normalized states of P j |0 + T P j |0 and P j |0 − T P j |0 respectively.Then there is a real number θ j ∈ [0, π/2] such that We can define the Grover operator G j from U j in the same way as in last subsection for the swap test and obtain similar eigen-relation.The value of 0| P † j T P j |0 is encoded in the phase of the eigenvalue of G j .This will be used in the phase encoding of QNN cost function which is a crucial component of the quantum training.

C. Creating Superpositions of QNNs
As an essential building block for our quantum training protocol, we present a way to create superpositions of QNNs entangled with corresponding parameters.That is, we construct a controlled unitary P such that for every θ (16) in which θ = (θ 1 , . . ., θ M ) is the set of trainable parameters in the QNN and U (θ) is the unitary of the QNN with corresponding parameters.When P acts on a superposition state of parameters θ ω θ |θ , we have The action of the controlled unitary P is depicted in Fig. 15.
This controlled unitary can be realized by dividing each rotation gate in QNN into a sequence of binary segments, followed by applying controlled operations on them.A simple example of one rotation gate, for example U (θ) = R z (θ), is illustrated in Fig. 16.
Each bit string of the parameter register can be seen as a binary representation of the rotation angle and the associated basis state of the register is entangled with the rotation gate of the corresponding angle.For instance, in the example above, the bit string 111 corresponds to the angle 7 θ/8 and |111 is associated with R z (7 θ/8), where θ is the maximum value that angle θ can take.This relation can be fully illustrated in Fig. 17, in which we take θ = π.FIG.
16.An example of the construction of P for one rotation gate Rz(θ).In this example, the parameter register consist of three qubits, each qubit controls a "partial" rotation on the fourth qubit.The "partial" rotation are the binary segments Rz( θ/2),Rz( θ/4),Rz( θ/8) in which θ is the maximum value that angle θ can take. FIG.
17.An example of the effect of P defined in Fig. 16.Each bit string of the parameter register can be seen as a binary representation of the rotation angle and the associated basis state of the register is entangled with the rotation gate of the corresponding angle.For instance, in the example above, the bit string 111 corresponds to the angle 7 θ/8 and |111 is associated with Rz(7 θ/8).
The unitary operator of P can be written as: in which P j is a specific configuration of the QNN defined by its control bit string j.This representation does not only apply to a single rotation gate, but also to the case where there are multiple parameterized rotation gates in the QNN.An example of two rotation gates is depicted in Fig. 18.
In order to achieve precision 0 for each rotation angle, the number of control qubits needed is d = log 2 (1/ 0 ) .Let r be the number of rotation gates in a QNN, then the total number of control qubits needed is dr.FIG.18. Example of the construction of P for QNN consisting of two rotation gates.In this example, the QNN consist of two rotation gates Rz(θ1), Rz(θ2) on the lower two qubits.The upper 6 qubits are divided into two parameter register for the two rotation angles θ1, θ2 respectively.Each qubit controls a "partial" rotation.For instance, the "partial" rotations of Rz(θ1) are the binary segments Rz( θ1/2),Rz( θ1/4),Rz( θ1/8) in which θ1 is the maximum value that angle θ1 can take.

III. QNN training by Grover Adaptive Search
In this section we discuss using Grover adaptive search to perform global optimisation of QNNs.As presented in Section II A 3, the core of the Grover adaptive search is the adaptive oracle defined in Eq. 2.
Next we detail how to construct such oracle for QNN training.

A. Construction of the Grover Oracle
The adaptive Grover Oracle O Grover in the context of QNN training acts as in which C * is the adaptive threshold for the cost function and the function g is defined as When O Grover is acting on a superposition state of parameters θ ω θ |θ , we have The QNN Grover oracle O Grover can be constructed by the following steps.

Amplitude Encoding
The first step is to encode the cost function of QNN into amplitude.Depending on the form of the cost function of the QNN, the amplitude encoding can be achieved by the swap test or Hadamard test.The correspondences are summarized in Table II.a. Amplitude Encoding by Swap test.For the task of learning a pure state |ψ = T |0 (T is a given unitary), the cost function is the fidelity between the generated state from the QNN and the state |ψ = T |0 .In this case the amplitude encoding can be achieved by swap test, as shown in the circuit in

Cost function Amplitude encoding Method
Generating Ground state of T Expectation value Here, Pj represents QNN with specific (the "jth") parameter configuration.To achieve swap test in parallel, we add an extra register -the parameter register-as the control of Pj: each computational basis j of the parameter register corresponds to a specific parameter configuration in Pj.As illustrated in Fig. 15, once the parameter register is in superposition state(by the Hadamard gates H ⊗dr ), the corresponding Pj are in superposition.
We refer the control operation on QNN as "controlled-QNN".Comparing with the normal swap test depicted in Fig. 11, the difference here is that the Swap ancilla qubit is anti-controlling /controlling the "controlled-QNN" together with the Unitary T (as gathered together in the dotted blue/orange box).It can be proven that the entire circuit in dotted the green box (denoted as U ) can be expressed as U = j |j j| ⊗ Uj where Uj is the swap test unitary for Pj defined in Fig. 11.This indicates that U effectively perform the swap test in parallel for multiple Pj.Recall the fact that the normal swap test Uj encode | pj|t | 2 in the amplitude of the output state (Eq.7 and Eq.6), here the "parallel swap test" U encodes the QNN cost function | pj|t | 2 in the amplitude of a superposition of Pj(QNN) with different parameters.

Fig. 19.
We denote the unitary for the swap test circuit (in dotted green box) as U , and the input and output state of U as |Ψ 0 and |Ψ 1 , respectively.The input to U , |Ψ 0 , can be written as (note here and throughout the paper, we omit the normalization factor): Then U can be written explicitly as Here, P j represents QNN with specific parameter configuration defined by its control bit string j, as defined in Eq. 18.It can be proven (see Appendix A) that U can be rewritten as where U j is the individual swap test unitary on unitary P j and target unitary T , defined as in Eq. 3: As in Eq. 6, the resulting state of U j acting on |Ψ 0 is n QNN2 and has the following form: The final output state from U , |Ψ 1 = U |Ψ 0 , is therefore From Eqs. b.Amplitude encoding by Hadamard Test.For the task of generating ground states of given Hamiltonian T , the cost function is the expectation value of T with respect to the generated state from the QNN.In this case the amplitude encoding can be achieved by the Hadamard test, as shown in the circuit in Fig. 20.
Since the analysis for the case of the Hadamard test is very similar to that of the swap test, we omit the details here.For the same reason, we only present the case using the swap test also in the next section when discussing phase encoding.

Amplitude estimation
The second step following the amplitude encoding is to use amplitude estimation [55] to extract and store the cost function into an additional register which we call the "amplitude register".In the following we present the details of amplitude estimation.
After the amplitude encoding by the swap test, we introduce an extra register |0 ⊗t amplitude and the output state |Ψ 1 (using the same notation) becomes where |φ j can be decomposed as Hence, we have The overall Grover operator G is defined as where C 1 is the Z gate on the swap ancilla qubit, and C 2 is "flip zero state" unitary which is similar to that defined in Fig. 12.It can be shown (see Appendix A) that G can be expressed as where G j is the individual Grover operator as defined in Eq. 10.The overall Grover operator G possess the following eigen-relation: Next we apply phase estimation of the overall Grover operator G on the input state |Ψ 1 .The resulting state |Ψ 2 can be written as Note here in Eq. 34, |±2θ j denotes the eigenvalues ±2θ j being stored in the amplitude register with some finite precision.

Threshold Oracle and Uncomputations
Next we apply a threshold oracle U O on the amplitude register and an extra phase ancilla qubit, which acts as where θ * is implicitly defined as Note that in Eq. 35 we omit the state of the phase ancilla qubit.
The state after the oracle |Ψ 3 can be written as Step 1(dotted green box): Amplitude encoding of the cost function, as illustrated in Fig. 19 (refer the caption of Fig. 19 for the meaning of each symbol), resulting in the state |Ψ1 = j |j ( sin θj |uj |0 + cos θj |vj |1 ), in which θi contains the cost function.
Step 2(dotted pink box): Amplitude estimation to extract and store the cost function into an additional register which we call the "amplitude register", resulting in the state Step 3(dotted yellow box): Threshold Oracle to encode the cost function into relative phase by using a Phase ancilla qubit, resulting in the state The procedure thus far can be illustrated in a circuit as in Fig. 21.
After we perform the uncomputation of Phase estimation, the resulting state is Finally, we perform the uncomputation of the swap test and the resulting state is As can be seen from Eqs. 40 and 7, the above steps implemented the Grover oracle O Grover (defined in Eq. 19) After the above procedure a relative phase, which depends on the cost function of the QNN | p j |t | 2 and the threshold, have been coherently added to the parameter state.Importantly, uncomputation allows the parameter register to be decoupled from the QNN and other registers.

B. Performance of the Quantum training by Grover Adaptive Search
Taking training VQE as an example, in Table .III we present the result for the number of "controlled-QNN" runs, the number of QNN runs and the number of measurements needed in the quantum training by Grover Adaptive Search.The derivation are included in Appendix B.

C. Advantages and disadvantages of training by Grover Adaptive Search
In the presence of a noise-free barren plateau, the Grover Adaptive Search mechanism can find global optima without an exponential number of measurements.However, it has the following disadvantages: • It can be seen from Table .III that in Quantum training by Grover adaptive search, the number of "controlled-QNN" runs is exponential in the number of parameters in QNN.Even in the case where the  number of parameters scales only linearly with the number of qubits in a QNN, the quantum training by Grover takes excessive runtime.Moreover, it invokes very deep circuit.
• Training by Grover adaptive search does not circumvent the noise-induced barren plateau.When the entire cost landscape is flatten in the case of noise-induced barren plateau [21], it requires exponential precision of the amplitude estimation.That is, 2 should be exponentially small.According to Table .III, this adds another exponentially large factor to the number of "controlled-QNN" runs and QNN runs.
While these disadvantages most probably rule out Grover adaptive search for NISQ-era devices, it still represents a maximally quantum solution.For fault-tolerant devices, this method is the provably optimal approach for QNN cost function with no structure, it enjoys a quadratic speed-up which is a significant improvement compare to the exponential "slow-down" of the classical training methods due to the barren plateau issue.

IV. QNN training by Adaptive QAOA
As depicted in Fig. 3, our framework for quantum training of QNNs consists of two major components.
• Phase oracle.This coherently encodes the cost function of QNNs onto a relative phase of a superposition state in the Hilbert space of the parameters [26].
• Adaptive Mixers.These exploit hidden structure in QNN optimisation problems, hence can achieve short-depth circuit [56].
Iterations of the phase oracle and the adaptive mixers constitute a QAOA routine which quantumly homing in on optimal network parameters of QNNs.This section presents the details of our framework.

A. Phase Oracle
We aim to coherently achieve the phase encoding for the cost function of the QNN by a phase oracle O Phase , which acts as (41) in which γ is a free parameter to be optimized.When O Phase is acting on a superposition state of parameters θ ω θ |θ , we have As detailed in Ref. [26], this phase oracle can be constructed based on the amplitude encoding which we have implemented in Section III A. Next we present the details of how to contruct the phase oracle from the amplitude encoding by amplitude estimation or Linear Combination of Unitaries (LCU) [57].

a. Phase oracle by amplitude estimation
The procedure to achieve O Phase by amplitude estimation is very similar to that of O Grover , the only difference is that the threshold U O (defined in Eq. 35) needs to be replaced by U O which acts as Recall Eq. 7, Eq. 15 and the form of the cost function in Table II, the cost function C(θ) is encoded in θ j as C(θ) = − cos 2θ j , therefore U O acts as Once we have chosen the specific value of γ, U O can be constructed according to Eq. 44.
b. Phase oracle by LCU For this approach, we start with constructing an operator G * defined similar as in Eq. 31: [58] It has been shown in Ref. [26] that where This series of G * can be implemented using the LCU technique (together with the subsequent "Oblivious Amplitude Amplification") [57] in which the number of calls to G * needed is only logarithmic of the inverse of the desired precision [26].Using the techniques in Ref. [59], we can convert phase oracle with e −iC (θ) into phase oracle with e −iγC (θ) for arbitrary γ bounded from [−1, 1]), by only logarithmic (of the inverse of the desired precision) number of queries of phase oracle with e −iC (θ) .
In Fig. 22 we summarise the two approaches for the Phase encoding of the cost function.

B. Adaptive Mixers
As in Section II A 2, we designed a new variant of QAOA -"Adaptive-Continuous(AC-QAOA)" -to be the ansatz of our quantum training for QNN.We summarise the reason of this choice as follow: 1. [Why "Continuous"] In our optimisation problem of QNN training, the parameters we are optimizing (the angles of rotation gates) are continuous variables (real values), hence the choice of mixer Hamiltonian has to be designed for continuous variables.For example, the mixer Hamiltonian of the original QAOA (X rotations) generate shifts in the computational basis, here in continuous case, the corresponding mixer should shift the value for each digitized continuous variables stored in independent registers.

[Why "
Adaptive"] The Cost function of QNNs is complicated and task-specific (given by the learning objectives).Hence it is hard to analytically determine good mixers for our optimisation problem of QNN training.Therefore we would want to take advantage of including alternative mixers and allowing adaptive mixers for different layer (as in ADAPT-QAOA). controlled-QNN Step 0 Step 1 Step 2 Step 3 Phase Oracle e −iγC′ (θ) FIG.22. Pipeline of the construction of the phase oracle.Here we summarise the two approaches by amplitude estimation and by LCU for the Phase encoding of the cost function.
Step 1: Amplitude encoding of the cost function, by the unitary operation U = j |j j| ⊗ Uj.
Step 2: Constructing the "Grover Operator" upon the amplitude encoding unitary.In the approach using amplitude estimation, the Grover Operator G is constructed as G = U C2U −1 C1.In the approach using LCU, the Grover Operator G * is constructed as G * = C2U −1 C1U .Step 3: Phase encoding of the cost function, by amplitude estimation(upper path) or by LCU(lower path).In the upper path, the Phase Oracle is achieved by phase estimation on G, threshold oracle U O , and uncomputation.In the lower path, LCU on G * (together with the subsequent "Oblivious Amplitude Amplification") [57] realizes e −iC (θ) which is then converted to the Phase Oracle with arbitrary γ -e iγC (θ) using the method in Ref. [59].
Adopting "AC-QAOA" could exploit hidden structure in QNN optimisation problem and dramatically shorten the depth of QAOA layers while significantly improving the quality of the solution [56].
Generally, the mixer pool of AC-QAOA should include two types of Mixer Hamiltonians for continuous variables: 1. Quadratic functions of the position operator and the momentum operator for single continuous variables, e.g. the squeezing operator [60].
These operators could be carried out in continuous variable quantum systems.However, we will focus on the circuit implementation of these mixers when using a collection of qubits to approximate the behavior of continuous variables.
When using a qudit of dimension d to digitally simulate a continuous variable, the position operator can be written as in which j is the digitized value of the continuous variable.
We can use N qubits to simulate the qudit and construct J d for d = 2 N as [25], where I (n) and Z (n) are the identity and the Pauli-Z operator (respectively) for the n th qubit.
The momentum operator, which act as generator of shifts in the value of a continuous variable (denote as S) can be written as the discrete Fourier transform of J d [25], in which the discrete Fourier transform F d is defined by where ω d := e 2πi/d .
As mentioned above, a general mixer Hamiltonian is the quadratic functions of the position operator J d and the momentum operator S, therefore using Eq.49 and Eq.50 (set d = 2 N ) we can rewrite a mixer Hamiltonian as a summation of simple unitaries.Hence utilising the Hamiltonian simulation technique in [57], the Mixer operator can be efficiently implemented.For instance, the digitized version of the generator of squeezing operator (denote as T ) is defined as: Plugging Eq.50 into Eq.52, together with Eq. 49 (set d = 2 N ), we can see that be expressed as the summation of simple unitaries.Therefore the corresponding Mixer with Hamiltonian T can be efficiently implemented using the Hamiltonian simulation technique in [57].Similarly, the entangling Mixers on two continuous variables with Hamiltonian S i S j , S i T j , T i T j (The subscript i, j indicate that they are for specific variables) can be implemented in the same manner.In Fig. 23, we depict the schematic diagram of applying AC-QAOA to QNN training.
Due to the fact that the non-Gaussian operators are costly to implement, we only consider up-to-secondorder polynomial functions of the position operator J d and the momentum operator S for the Mixer Hamiltonian.The Mixer pool can generally include mixers with Hamiltonians: J d , S, J d S, SJ d , J d 2 , S 2 , J d 2 + S 2 for one continuous variable and the entangling Mixers for two continuous variables.Comparing to the Mixer pool of ADAPT-QAOA for discrete variable, we can have the following the analogy: 1.The momentum operator S is the (digitized) continuous version of X mixers that shift the value for each digitized continuous variables stored in independent registers.
2. J d S is the (digitized) continuous version of Y mixers which 'unlock' geodesics in parameter space, allowing the QAOA iterations reaching the target state faster.[29] We note that quadratic Hamiltonians are efficiently simulatable (classically), but only when the initial state is from a special class of Gaussian states (e.g. the vacuum state) [61].Here, the initial state in the qubit encoding is far from Gaussian and a continuous variable analog of our technique would use an equivalent encoding.
By making the mixers flexible and adaptive to specific optimisation problems, it is demanding to find an efficient way of determining the mixers sequence and optimizing the hyper-parameters.There are several research works on using machine learning approaches (Recurrent Neural Networks (RNN) and Reinforcement Learning(RL)) to determine the mixers sequence and optimize the hyper-parameters.These works achieved significantly less measurements than conventional approach(e.g.gradient based methods).We list the papers in the following table:

C. Advantages of training by QAOA
As we have discussed in III C, due to the global-search nature of Grover's algorithm, the quantum training using Grover Adaptive Search can circumvent the noise-free barren plateau, however it has certain limitations and disadvantages such as: 1. cannot handle the noise-induced barren plateau; 2. requires an exponential number of calls to the "controlled-QNN" with excessive lengths of circuit and run time.
In contrast, our quantum training using adaptive continuous QAOA could eliminate the limitations of that using Grover Adaptive Search and the advantages come in the following two folds: 1.The phase oracle by LCU approach does not explicitly evaluate/store the value of the cost function at any stage of the algorithm and the number of calls to the "controlled-QNN" scales only logarithmic with respect to the inverse of the desired precision [26].Therefore the phase encoding is not affected by the noise-induced barren plateau for which the precision required is exponentially small.This is better than the case using Grover Adaptive Search.
2. The adaptive mixers can dramatically reduce the number of QAOA iterations while significantly increasing the quality of the output solution.This will enable our quantum training to achieve high performance within relatively shallow circuit and short run time.Thanks to the phase encoding faithfully conserving all the information and structure in the cost function, our adaptive QAOA protocol can exploit hidden structures in the QNN training problem.(Whereas, the Grover Oracle 'cuts off' the cost function with the threshold effectively losing some information and structure in the cost function.)Therefore, adaptive QAOA can offer beyond-Grover speed-up.Moreover, numerical experiments in [29] show that when using the adaptive approach, the depth of the QAOA steps can be independent of the problem size (number of qubits), this would yield even more advantage when system size scales up.

V. Applications
In this section we discuss several applications of QNN which our quantum training algorithm can apply to.For each application, we first briefly illustrate the usage of QNN and the corresponding cost function for the task, then we present the way of amplitude encoding tailored for this application.Based on the amplitude encoding, the construction of the full quantum training algorithm is similar for every application.

A. Training VQE
Variational quanutm eigensolvers (VQEs) utilize QNN to estimate the eigenvalue corresponding to some eigenstate of a Hamiltonian.The most common instance is ground state estimation in which the QNN (a parameterized circuit ansatz) is applied to an initial state (e.g.the zero state) over multiple qubits to generate the ground state.The parameters in the QNN are optimized so that the generated state of the QNN possess the lowest expectation value of the given Hamiltonian.A schematic of VQE for groud state estimation is presented in Fig. 24.
Our goal is then to estimate Here we use a technique "Linear Combinations of Unitaries"(LCU) [64] to implement the Hamiltonian.Define new unitary oracles W, H LCU such that The amplitude encoding of the cost function of the QNN can be implemented using the following circuit in Fig. 25

B. Learning to generate a pure state
Another application of our quantum training is when QNN is served as a generative model to learn a pure state.In our scenario, the target state is generated by a given unitary (e.g. a given sequence of gates), the QNN serves as another generator circuit for the target state.The parameters in QNN are optimized such that the generated state of QNN matches the target state.This approach can be used to transform a given sequence of gates to a different/simpler sequence (e.g.translating circuits from superconducting gate sets to ion trap gate sets) A schematic of this application is presented in Fig. 26.
The amplitude encoding for this application has been given in section III A 1 a.

C. Training a Quantum Classifier
Finally, we discuss the application of QNN as a quantum classifier that performs supervised learning which is a standard problem in machine learning.
To formalise the learning task, let X be a set of inputs and Y a set of outputs.Given a dataset D = {(x 1 , y 1 ), ..., (x M , y M )} of pairs of so called training inputs x m ∈ X and target outputs y m ∈ Y for FIG. 26.Schematic of using QNN to generate a pure state.In our scenario, the target state is generated by a given Unitary T , i.e. |Ψtarget = T |0 ), the QNN (denoted as U (θ)) serves as another generator circuit for the target state.The parameters in QNN are optimized such that the generated state of QNN |ΨQNN matches the target state.The cost function is the fidelity between the target state and the generated state by QNN.m = 1, ..., M , the task of the model is to predict the output y ∈ Y of a new input x ∈ X .For simplicity we will assume in the following that X = R N and Y = {0, 1}, which is a binary classification task on a N -dimensional real input space.In summary the quantum classifier aim to learn an effective labeling function : X → {0, 1}.
Given an input x i and a set of parameters θ, the quantum classifier first embeds x i into the state of a n-qubit quantum system via a state preparation circuit S xi such that S xi |0 = |ϕ(x i ) , and subsequently uses a learnable quantum circuit U (θ) (QNN) as a predictive model to make inference.The predicted class label y (i) = f (x i , θ) is retrieved by measuring a designated qubit in the state U (θ) |ϕ(x) .A schematic of the quantum classifier is presented in Fig. 27.Note although the variational quantum classifier could be operated as a multiclass classifier, here we limit ourselves to the case of the binary classification discussed above and cast the multi-label tasks as a set of binary discrimination subtasks.
Data embedding unitary QNN Prediction: It follows immediately e −iγC(θ) = Π i e −iγL(xi,yi,θ) .(61) Therefore the phase encoding of the total cost function can be implemented by accumulating individual phase encoding for each training sample, this process can be illustrated in Fig. 29.Armed with the phase encoding of the total cost function, we can now construct the full quantum training protocol as in Fig. 30.

VI. Discussion
In this paper we proposed a framework leveraging quantum optimisation routines for training QNNs.We have designed a variant of QAOA (AC-QAOA) tailored for QNN training problems.Our framework of using AC-QAOA to train QNNs consist of two major components: 1.A Phase Oracle that can achieve coherent phase encoding of the cost function of QNN, and 2. Adaptive Mixers that can dramatically shorten the depth of QAOA layers while significantly improving the quality of the solution.We adopt RNN and RL to determine mixers sequence and optimize hyper-parameters.Various applications which our quantum algorithm can apply to were presented.
QAOA itself and all of its variants are, by construction, heuristics, and therefore their advantages are ultimately determined by testing performance on concrete problems.Heuristically, AC-QAOA is expected to process the advantages of its ancestors, i,e, ADAPT-QAOA and QDD.We leave as future work for demonstrating the advantages by numerical experiments.The estimation of the number of qubits needed in given in Appendix.C. For a small toy example with 5 qubits and 10 rotation gates in the QNN, our protocol requires roughly 60 qubits to implement.Thus, we expect to demonstrate our protocol on near-term devices.
In this paper we have only discussed optimizing the rotation parameters in QNNs (which belongs to a continuous optimisation problem).However, our framework can also be used in learning circuit structurei.e., to find better acircuit ansatz (which belongs to discrete optimisation problem) -or even learning the structure and parameters simultaneously.We leave these extensions to future work.Furthermore, we would like to explore the possibility of applying other quantum optimisation algorithms (such as adiabatic quantum evolution, quantum walks, etc.) to QNN training.We hope this work will provide a useful framework for (B14)

Number of QNN runs
After each measurement on the parameter register, we obtain a specific parameter configuration of QNN.We then need to estimate the cost function for this particular parameter configuration.For VQE, the cost function is the expectation value of some Hamiltonian and the number of the estimation sacle as O(1/ α ) for some small power α [66](α is a small integer about 1 or 2), where is the desired accuracy of the expectation value.For our QNN training we choose the accuracy to be 2 defined above.Taking α = 1, the number of QNN runs after each measurements scale as O(1/ 2 ) and the total number of QNN runs all the measurements during the quantum training scale as Inserting B14 into B15 we have r log 1 Taking training VQE as example, the number of qubits in each register can be listed as: • For QNN register: n qubits • For Parameter register: dr qubits (r is the number of parameters in QNN, d is the number of control qubits for each rotation angle.) • For Hadamard test: 1 ancilla qubit • For LCU register and other registers: O(log log(1/ )) [26] qubits ( is the precision of implememnting the phase oracle by LCU) In total, the number of qubits needed for quantum training is: For instance, when n = 5, d = 5, r = 10, = 10 −8 , n total ≈ 60.

FIG. 1 .
FIG. 1. Schematic of Our quantum training algorithm for VQE.Here we use the training of VQE as an example, to present the schematic circuit construction of our quantum training algorithm for QNN.A video animation of the circuit construction is available at https://youtu.be/RVWkJZY6GNY.(This is vector image and best view with the zoom feature in standard PDF viewers.)Note: 1.In all figures of this Paper, we omit the minus signs in all time-evolution-like terms (i.e.exponential of a Hamiltonian e −iHt ) for sake of brevity and space.2. Some quantum registers are not depicted in this figure due to the limitation of space.

FIG. 2 .
FIG.2.QAOA-like training protocol for QNN, proposed in Ref.[25].The quantum training protocol consists of two alternating operations in a QAOA fashion -the first operation acts on both the parameter register and QNN register to encode the cost function of QNN onto a relative phase of the parameter state.This operation is represented by the blue blocks in the figure.The second operation acts only on the parameter register and it is a variant of the original QAOA Mixers, tailored for the case that the parameters in the QNN are continuous variables.This operation is represented by the pink blocks in the figure.These two operation can be mathematically expressed as e −iγ i C(θ) and e −iβ i H M , where θ are the parameters of QNN, C(θ) is the cost function of the QNN, and γi and βi are tunable hyperparameters,HM is the Mixer Hamiltonian.The width of each block represents the hyperparameters γi and βi -the wider the block, the larger the value of the hyperparameters.The phase encoding operation e −iγ i H C act as e −iγ i C(θ) .

FIG. 3 .
FIG.3.Schematic of our framework for quantum training of QNN.Our quantum training for QNN taking advantage of the well-established parts in Refs.[25] and[26], while eliminating their shortcomings.We replace the phase encoding operations in QAOA-like protocol of Ref.[25](as depicted in Fig2) by the phase oracle in Ref.[26].For the mixers in the QAOA-like routine, we allow different mixers for each layer, similar to Ref.[27].In this figure, the color of each block represents the nature of the corresponding Hamiltonian: different color corresponds to different Hamiltonian (One can see that the Cost Hamiltonian is the same throughout the training whereas the mixer varies from layer to layer).The mixers pool contains the proper mixers tailored to our QNN training problem.These rules also apply to the other circuit schematic in this paper.

e iβ 1 FIG. 4 .
FIG. 4. Interference process of QAOA.QAOA is an interference-based algorithm such that non-target states interfere destructively while the target states interfere constructively.Here we illustrate this interference process by presenting the evolution of the quantum state of the parameters (black bar graphs on the yellow plane) alongside with the QAOA operations (blue and pink boxes on circuit lines, representing the Phase encoding and Mixers respectively).The starting state θ |θ (omitting the normalization factor) is the even superposition state of all possible parameter configurations.After the first Phase encoding operation, the state becomes θ e −iγ 1 C(θ) |θ for which we use opacity of the bars indicate the value of the phase, the magnitudes of the amplitudes in the state remains unchanged.After the first Mixer, the state becomes θ Ψ C(θ) |θ in which the magnitudes of the amplitudes in the state has changed.Similar process happens to the following operations, until the amplitudes of the optimal parameter configurations are amplified significantly (the furthest bar graph).The grey bar graph in the right corner is the cost function being optimized by QAOA.

FIG. 5 .
FIG. 5.Quantum circuit schematic of the operations in the original QAOA.The state is initialized by applying Hadamard gates on each qubit, represented as H ⊗n .This results in the equal superposition state of all possible solutions.QAOA consists of alternating time evolution under the two Hamiltonians HC and H M for p rounds, where the duration in round j is specified by the parameters γj and βj, respectively.In the original QAOA, the mixing Hamiltonian H M is chosen as to be H M = n j=1 Xj, After all p rounds, the state becomes |β, γ = e −iβpH M e −iγpH C . . .e −iβ 2 H M e −iγ 2 H C e −iβ 1 H M e −iγ 1 H C |s .

FIG. 8 .
FIG.8.Quantum circuit schematic of ADAPT-QAOA.The overall process of LH-QAOA is similar to that of the original QAOA in Fig.5, where the difference is that the mixer of LH-QAOA contains variable mixers taken from a mixers pool.Define Q to be the set of qubits.The mixer pool of ADAPT-QAOA is P ADAPT-QAOA = ∪i∈Q Xi, Yi, i∈Q Xi, i∈Q Yi ∪i,j∈Q×Q {BiCj|Bi, Cj ∈ {X, Y, Z}}.

FIG. 9 .
FIG. 9. Comparison of original QAOA and ADAPT-QAOA.In the left and right panels of this figure, we depict the state change in the Hilbert space of the parameters to be optimized, for the original QAOA and ADAPT-QAOA respectively.The starting state θ |θ (omitting the normalization factor), represented by the rounded dot at the bottom of each space, is the even superposition state of all possible solutions.The arrows represent the state evolution generated by the cost Hamiltonian and mixer Hamiltonian, and the color and direction of the arrows indicate the nature of the evolution.Blue arrows represent the evolution by the cost Hamiltonian.Arrows of other colors represent the evolution by different mixer Hamiltonians.In the original QAOA, there is only one mixer (shown in pink) available.Whereas, in ADAPT-QAOA there are more alternative mixers to chose from the mixers pool.The two algorithm try to reach the target state |θ * (represented by the blue star) by stacking these arrows, which represent the alternating operations of two QAOAs.For reference we sketched the relevant paths -adiabatic path for the original QAOA and counter-diabatic path for ADAPT-QAOA -along the state evolution of the two QAOAs.As can be seen, the ADAPT-QAOA takes much fewer iterations to reach a closer point to the target state.This illustrates that compared to the original QAOA, allowing alternative mixers enables ADAPT-QAOA to dramatically improve algorithmic performance while achieving rapid convergence.

S 2 e
FIG.10.Quantum circuit schematic of AC-QAOA.AC-QAOA is a variant of QAOA we designed for solving optimisation of continuous variables with the short-depth advantage of QAOA layers.In this figure, θi are the continuous variables to be optimized in the training.Each θi is digitized into binary form and stored in an independent register.The overall process of AC-QAOA is similar to that of the original QAOA, with the difference being as follows.1.The mixers of AC-QAOA with Hamiltonians Si and Ti are acting on the registers of θi (rather than single qubits as in the original QAOA).2. The mixers of AC-QAOA contain alternative mixers taken from a mixers pool and can vary from layer to layer.

FIG. 14 .
FIG. 14. Circuit diagram of Hadamard Test.The circuit is used to estimate 0| P † j T Pj |0 , for two unitary Pj and T .The Hadamard test will be used the phase encoding of QNN cost function which is a crucial component of the quantum training.

27 and 7
we can see that the cost function (fidelity | p j |t | 2 ) for different parameters has been encoded into the amplitudes of the state |Ψ 1 .

1 e iβ 2 T 2 e iγ 3 H C γ 3 e iβ 3 S 1 S 2 e 1 β 1 e iβ 1 S 2 e iβ 1 FIG. 23 .
FIG. 23.Schematic diagram of applying AC-QAOA to QNN training.AC-QAOA is a variant of QAOA we designed for solving optimisation of continuous variables with the short-depth advantage of QAOA layers, see Fig. 10.This figure illustrate applying AC-QAOA to QNN training, following the scheme in Fig. 3.The quantum training protocol consists of alternating operations in a QAOA fashion -the first operation acts on both the parameter register and QNN register to encode the cost function of QNN onto a relative phase of the parameter state.This operation is represented by the blue blocks in the figure.The other operations are the Mixers (green and pink boxes) which act only on the parameter register.In the parameter register, θi are the continuous variables to be optimized in the training, each θi is digitized into binary form and stored in an independent register.The overall process of AC-QAOA is similar to that of the original QAOA, with the difference being as follows.1.The mixers of AC-QAOA with Hamiltonians Si and Ti are acting on the registers of θi (rather than single qubits as in the original QAOA).2. The mixers of AC-QAOA contain alternative mixers taken from a mixers pool and can vary from layer to layer.

FIG. 27 .
FIG. 27.Schematic of a classifier.For a training data point (xi, yi), the quantum classifier first embeds xi into the state of a n-qubit quantum system via a data embedding circuit Sx i (purple box) such that Sx i |0 = |ϕ(xi) , and subsequently uses a learnable quantum circuit U (θ) (QNN) as a predictive model to make inference (here for simplicity we use the same symbol θ for all the angles of different gates).The predicted class label y (i) = f (xi, θ) is retrieved by measuring a designated qubit in the state U (θ) |ϕ(x) .

e iγ 1 1 S x 2 eFIG. 29 .
FIG. 29.Phase encoding of the total cost function of quantum classifier.The total cost function of the whole training set can be defined as: C(θ) = i L(xi, yi, θ).It follows immediately e −iγC(θ) = Πie −iγL(x i ,y i ,θ) .Therefore the phase encoding of the total cost function (the overall yellow box) can be implemented by accumulating individual phase encoding for each training sample(blue boxes).In this figure, we omit θ in L(xi, yi, θ) for simplicity.The inner boxes in the blue boxes represent different data embedding unitary for the training data points.

1 e 1 S x 2 eFIG. 30 .
FIG. 30.Schematic of our quantum training protocol for quantum classifier.The full quantum training protocol consists of the alternation of the Phase Oracle that achieve coherent phase encoding of the cost function and the Adaptive Mixers chosen from a Mixers pool.The phase encoding of the total cost function for the quantum classifier are detailed in Fig. 29.The total cost function of the whole training set can be defined as: C(θ) = i L(xi, yi, θ).It follows that e −iγC(θ) = Πie −iγL(x i ,y i ,θ) .Therefore the Phase Oracle for the total cost function (the yellow boxes in the upper part of this figure) can be implemented by accumulating individual phase encoding for each training sample(blue boxes).In this figure, we omit θ in L(xi, yi, θ) for simplicity.The colorful boxes with white border represent different data embedding unitary for the training data points.The colorful boxes with black border (excluding the blue ones for the Phase encoding) represent different Mixers chosen from a Mixers Pool.
of qubits needed for Quantum training by AC-QAOA 15. Action of the controlled unitary P .In this figure, the upper register is parameter register and the lower register is the QNN register.θ = (θ1, . . ., θM ) is the set of trainable parameters in the QNN and U (θ) is the unitary of the QNN with corresponding parameters.The qubits in the parameter register act as control qubits on the rotation gates in the QNN.The controlled operations (in the dotted blue box) is denoted as P .When P acts on a superposition state of parameters θ ω θ |θ , the output state is θ ω θ |θ ⊗ U (θ) |0 . in which the parameter register and QNN register are entangled.

TABLE II .
QNN Cost functions for two type of tasks.Here we present the Cost functions for two tasks respectively: For the task of Generating Ground state of some given Hamiltonian T (we use T instead of H here, and assume T is Hermitian), the cost function is chosen to be the Expectation value of T .For the task of Generating a pure state |ψ = T |0 (T is a given unitary), the cost function is chosen to be the Fidelity between the generated state from QNN and the state |ψ = T |0 .
19G.19.Amplitude Encoding by Swap test.This circuit can perform the swap test depicted in Fig.11in parallel for multiple Pj.
Amplitude encoding by Hadamard Test This circuit can perform the Hadamard test depicted in Fig.14in parallel for multiple Pj.Here, Pj represents QNN with specific (the "jth") parameter configuration.To achieve Hadamard test in parallel, we add an extra register -the parameter register-as the control of Pj: each computational basis j of the parameter register corresponds to a specific parameter configuration in Pj.As illustrated in Fig.15, once the parameter register is in superposition state (by the Hadamard gates H ⊗dr ), the corresponding Pj are in superposition.It can be proven that the entire circuit in dotted the green box (denoted as U ) can be expressed as U = j |j j|⊗U j where U j is the Hadamard test unitary for Pj defined in Fig.14.This indicates that U effectively perform the Hadamard test in parallel for multiple Pj.Recall the fact that the normal Hadamard test U j encode 0| P † j T Pj |0 in the amplitude of the output state (Eq.14 and Eq.15), here the "parallel Hadamard test" U encodes the QNN cost function 0| P † j T Pj |0 in the amplitude of a superposition of Pj(QNN) with different parameters.
Major steps in the Construction of the Grover Oracle.Step 0: We initialize the system by applying Hadamard gates on the parameter register, leading to the state

TABLE III .
Performance of the Quantum training by Grover Adaptive Search.Here we present the result for the number of "controlled-QNN" runs, the number of QNN runs and the number of measurements needed in the quantum training by Grover Adaptive Search.In this table, r is the number of parameters (rotation angles) in QNN, 1 − 1 is the probability of success of the phase estimation, 2 is the precision we set up for the evaluation of the cost function using amplitude estimation, 0 is the precision of each angle value, s is the number of global optima of the QNN cost function. [26]: [64] 25. Circuit for the amplitude encoding of the cost function for VQE.Here we use the Hadamard Test Circuit for the amplitude encoding of the cost function, as detailed in III A 1 b.We use a technique "Linear Combinations of Unitaries"(LCU)[64]to implement the given Hamiltonian H = i aiUi.The unitary oracles W, H LCU are defined as