Setup
To examine the effectiveness of LL, we use it to train a circuit with fully connected layers as described in Section 3. While fully connected layers are not realistic on NISQ hardware, we choose this configuration for our numerical investigations because it leads circuits to converge to a 2-design with the smallest number of qubits and layers (McClean et al. 2018), which allows us to reduce the computational cost of our simulations while examining some of the most challenging situations. To compare the performance of LL and CDL, we perform binary classification on the MNIST data set of handwritten digits, where the circuit learns to distinguish between the numbers six and nine. We use the binary cross-entropy as the training objective function, given by:
$$ -\mathcal{L}(\boldsymbol{\theta}) = - \left( y \log{(E(\boldsymbol{\theta}))} + (1-y) \log{(1-E(\boldsymbol{\theta}))}\right)~, $$
(5)
where \(\log \) is the natural logarithm, E(𝜃) is given by a measurement in the Z-direction M = Zo on qubit o which we rescale to lie between 0 and 1 instead of − 1 and 1, y is the correct label value for a given sample, and 𝜃 are the parameters of the PQC. The loss is computed as the average binary cross entropy over the batch of samples. In this case, the partial derivative of the loss function is given by:
$$ \frac{\partial \mathcal{L}(\boldsymbol{\theta}) }{\partial \theta_{i}} = y \frac{1}{E(\boldsymbol{\theta})} \frac{\partial E(\boldsymbol{\theta})}{\partial \theta_{i}} - (1-y) \frac{1}{1 - E(\boldsymbol{\theta})} \frac{\partial E(\boldsymbol{\theta})}{\partial \theta_{i}}~. $$
(6)
To calculate the objective function value, we take the expectation value of the circuit of observable M:
$$ E(\boldsymbol{\theta}) = \langle{\psi}|U^{\dagger}(\boldsymbol{\theta}) M U(\boldsymbol{\theta})|{\psi}\rangle, $$
(7)
where |ψ〉 is the initial state of the circuit given by the training data set. The objective function now takes the form \({\mathscr{L}}(E(\boldsymbol {\theta }))\) and the partial derivative for parameter 𝜃i is defined using the chain rule as:
$$ \frac{\partial \mathcal{L}}{\partial \theta_{i}} = \frac{\partial \mathcal{L}}{\partial E(\theta_{i})} \cdot \frac{\partial E(\theta_{i})}{\partial \theta_{i}}~. $$
(8)
To compute gradients of E(𝜃), we use the parameter shift rule (Schuld et al. 2019), where the partial derivative of a quantum circuit is calculated by the difference of two expectation values as:
$$ \begin{array}{@{}rcl@{}} \frac{\partial E(\boldsymbol{\theta})}{\partial \theta_{i}} = r\left( \langle{\psi}| U^{\dagger}(\boldsymbol{\theta} + s \hat \theta_{i}) M U(\boldsymbol{\theta} + s\hat \theta_{i}) |{\psi}\rangle\right.\\ \left. - \langle{\psi}| U^{\dagger}(\boldsymbol{\theta} -s \hat \theta_{i}) M U(\boldsymbol{\theta} - s \hat \theta_{i}) |{\psi}\rangle\right)~, \end{array} $$
(9)
where \(\hat \theta _{i}\) is a unit vector in the direction of the i th component of 𝜃, s = π/4r, and r = 0.5.
We note that in the numerical implementation, care must be taken to avoid singularities in the training processes related to E(𝜃) = {0,1} treated similarly for both the loss and its derivative (we clip values to lie in [10− 15,1 − 10− 15]). We choose the last qubit in the circuit as the readout o, as shown in Fig. 2. An expectation value of 0 (1) denotes a classification result for class sixes (nines). As we perform binary classification, we encode the classification result into one measurement qubit for ease of implementation. This can be generalized to multi-label classification by encoding classification results into multiple qubits, by assigning the measurement of one observable to one data label. We use the Adam optimizer (Kingma and Ba 2015) with varying learning rates to calculate parameter updates and leave the rest of the Adam hyperparameters at their typical publication values.
To feed training data into the PQC, we use qubit encoding in combination with principal component analysis (PCA), following Grant et al. (2018). Due to the small circuits used in this work, we have to heavily downsample the MNIST images. For this, a PCA is run on the data set, and the number of principle components with highest variance corresponding to the number of qubits is used to encode the data into the PQC. This is done by scaling the component values to lie within [0,2π), and using the scaled values to parametrize a data layer consisting of local X-gates. In case of 10 qubits, this means that each image is represented by a vector \(\vec d\) with the 10 components, and the data layer can be written as \({\prod }_{i=1}^{10} \exp (-i d_{i} X_{i})\).
Different circuits of the same size behave more and more similarly during training as they grow more random as a direct consequence of the results in McClean et al. (2018). This means that we can pick a random circuit instance that, as a function of its number of qubits and layers, lies in the 2-design regime as shown in Fig. 1, and gather representative training statistics on this instance. As noted in Section 3, an LL scheme is more advantageous in a setting where training the full circuit is infeasible; therefore, we pick a circuit with 8 qubits and 21 layers for our experiments, at which size the circuit is in this regime. When using only a subset of qubits in a circuit as readout, a randomly generated layer might not be able to significantly change its output. For example, if in our simple circuit in Fig. 2, U5(𝜃1,5) is a rotation around the Z axis followed only by CZ gates, no change in 𝜃1,5 will affect the measurement outcome on the bottom qubit. When choosing generators randomly from {X,Y,Z} in this setting, there is a chance of 1/3 to pick an unsuitable generator. To avoid this effect, we enforce at least one X gate in each set of layers that is trained. For our experiments, we take one random circuit instance and perform LL and CDL with varying hyperparameters.
Sampling requirements
To give insight into the sampling requirements of our algorithm, we have to determine the components that we need to sample. Our training algorithm makes use of gradients of the objective function that are sampled from the circuit on the quantum computer via the parameter shift rule as described in Section 4.1. The precision of our gradients now depends on the precision of the expectation values for the two parts of the r.h.s. in Eq. 9. The estimation of an expectation value scales in the number of measurements N as \(\mathcal {O}(\frac {1}{\epsilon ^{\alpha }})\), with error 𝜖 and α > 1 (Knill et al. 2007). For most near-term implementations using operator averaging, α = 2, resembling classical central limit theorem statistics of sampling. This means that the magnitude of partial derivatives \(\frac {\partial E}{\partial \theta _{i}}\) of the objective function directly influences the number of samples needed by setting a lower bound on 𝜖, and hence the signal-to-noise ratio achievable for a fixed sampling cost. If all of the magnitudes of \(\frac {\partial E}{\partial \theta _{i}}\) are much smaller than 𝜖, a gradient-based algorithm will exhibit dynamics more resembling a random walk than optimization.
Comparison to CDL strategies
We compare LL to a simple approach to avoid initialization on a barren plateau, which is to set all circuit parameters in a circuit to zero followed by a CDL training strategy. We argue that considering the sampling requirements of training PQCs as described in Section 4.2, an LL strategy will be more frugal in the number of samples it needs from the QPU. Shallow circuits produce gradients with larger magnitude as can be seen in Fig. 1, so the number of samples 1/𝜖2 we need to achieve precision 𝜖 directly depends on the largest component in the gradient. This difference is exhibited naturally when considering the number of samples as a hyperparameter in improving time to solution for training. In this low sample regime, the training progress depends largely on the learning rate. A small batch size and low number of measurements will increase the variance of objective function values. This can be balanced by choosing a lower learning rate, at the cost of taking more optimization steps to reach the same objective function value. We argue that the CDL approach will need much smaller learning rates to compensate for smaller gradient values and the simultaneous update of all parameters in each training step, and therefore more samples from the QPU to reach similar objective function values as LL. We compare both approaches w.r.t. their probability to reach a given accuracy on the test set and infer the number of repeated re-starts one would expect in a real-world experiment based on that.
In order to easily translate the results here to experimental impact, we also compute an average runtime by assuming a sampling rate of 10 kHz. This value is assumed to be realistic in the near term future, based on current superconducting qubit experiments shown in Arute et al. (2019) which were done with a sampling rate of 5 kHz, not including cloud latency effects. The cumulative number of individual measurements taken from a quantum device during training is defined as:
$$ r_{i} = r_{i-1} + 2n_{p} m b~, $$
(10)
where np is the number of parameters (taken times two to account for the parameter shift rule shown in Section 4.1), m the number of measurements taken from the quantum device for each expectation value estimation, and b the batch size. This gives us a realistic estimate of the resources used by both approaches in an experimental setting on a quantum device.
Numerical results
For the following experiments, we use a circuit with 8 qubits, 1 initial layer, and 20 added layers, which makes 21 layers in total. As can be seen in Fig. 1, this is a depth where a fully random circuit is expected to converge to a 2-design for the all-to-all connectivity that we chose. After doing a hyperparameter search over p,q, and el, we set the LL hyperparameters to p = q = 2 and el = 10, with one initial layer that is always active during training. This means that three layers are trained at once in phase one of the algorithm, and 10 and 11 layers are trained as one contiguous partition in phase two, respectively. For CDL, the same circuit is trained with all-zero initialization.
We argue that LL not only avoids initialization on a plateau but is also less susceptible to randomization during training. In NISQ devices, this type of randomization is expected to come from two sources: (i) hardware noise, (ii) shot noise, or measurement uncertainty. The smaller the values we want to estimate and the less exact the measurements we can take from a QPU are, the more often we have to repeat them to get an accurate result. Here, we investigate the robustness of both methods to shot noise. The hyperparameters we can tune are the number of measurements m, batch size b, and learning rate η. The randomization of circuits during training can be reduced by choosing smaller learning rates to reduce the effect of each individual parameter update, at the cost of more epochs to convergence. Therefore, we focus our hyperparameter search on the learning rate η, after fixing the batch size to b = 20 and the number of measurements to m = 10. This combination of m and b was chosen for a fixed, small m after conducting a search over b ∈{20,50,100} for which both LL and CDL could perform successful runs that do not diverge during training. As we lower the batch size, we also increase the variance in objective function values similar to when the number of measurements is reduced, so these two values have to be tuned to match each other. In the remainder of this section, we show results for these hyperparameters and different learning rates for both methods. All of the results are based on 100 runs of the same hyperparameter configurations. We use 50 samples of each class to calculate the cross entropy during training, and another 50 samples per class to calculate the test error. To compute the test error, we let the model predict binary class labels for each presented sample, where a prediction ≤ 0.5 is interpreted as class 0 (sixes) and > 0.5 as class 1 (nines). The test error is then the average error over all classified samples.
Figure 3 shows average runtimes of LL and CDL runs that have a final average error on the test set that is less than 0.5, which corresponds to random guessing. We compute the runtime by computing the number of samples taken as shown in Section 4.3 and assume a sampling rate of 10 kHz. Here, LL reaches a lower error on the test set on average and also requires a lower runtime to get there. Compared to the CDL configuration with the highest success probability shown in Fig. 4b (red line), the best LL configuration (blue line) takes approximately half as much time to converge. This illustrates that LL does not only increase the probability of successful runs but can also drastically reduce the runtime to train PQCs by only training a subset of all parameters at a given training step. Note also that the test error of CDL with η = 0.05 and η = 0.01 slowly increases at later training steps, which might look like overfitting at first. Here it is important to emphasize that these are averaged results, and what is slowly increasing is rather the percentage of circuits that have randomized or diverged at later training steps. The actual randomization in an individual run usually happens with a sudden jump in test error, after which the circuit can not return to a regular training routine anymore.
Figure 4a shows the number of expected training repetitions one has to perform to get a training run that reaches a given accuracy on the test set, where we define accuracy as (1 −errortest). One training run constitutes training the circuit to a fixed number of epochs, where the average training time for one run is shown in Fig. 3. An accuracy of 0.5 corresponds to random guessing, while an accuracy of around 0.73 is the highest accuracy any of the performed runs reached and corresponds to the model classifying 73% of samples correctly. We note that in a noiseless setting as shown in the Appendixppendix, both LL and CDL manage to reach accuracies around 0.9, and the strong reduction in number of measurements leads to a decrease in the final accuracy reached by all models. We find that LL performs well for different magnitudes of learning rates as η = 0.01 and η = 0.005, and that these configurations have a number of expected repetitions that stays almost constant as we increase the desired accuracy. On average, one needs less than two restarts to get a successful training run when using LL. For CDL, the number of repetitions increases as we require the test error to reach lower values. The best configurations were those with η = 0.001 and η = 0.005, which reach similarly low test errors as LL, but need between 3 and 7 restarts to succeed in doing so. This is due to the effect of randomization during training, which is caused by the high variance in objective function values, and the simultaneous update of all parameters in each training step. In Fig. 4b, we show the probability of each configuration shown in (a) to reach a given accuracy on the test set. All CDL configurations have a probability lower than 0.3 to reach an accuracy above 0.65, while LL reaches this accuracy with a probability of over 0.7 in both cases. This translates to the almost constant number of repetitions for LL runs in Fig. 4a. Due to the small number of measurements and the low batch size, some of the runs performed for both methods fail to learn at all, which is why none of the configurations has a success probability of 1 for all runs to be better than random guessing.