Subtleties in the trainability of quantum machine learning models

A new paradigm for data science has emerged, with quantum data, quantum models, and quantum computational devices. This field, called quantum machine learning (QML), aims to achieve a speedup over traditional machine learning for data analysis. However, its success usually hinges on efficiently training the parameters in quantum neural networks, and the field of QML is still lacking theoretical scaling results for their trainability. Some trainability results have been proven for a closely related field called variational quantum algorithms (VQAs). While both fields involve training a parametrized quantum circuit, there are crucial differences that make the results for one setting not readily applicable to the other. In this work, we bridge the two frameworks and show that gradient scaling results for VQAs can also be applied to study the gradient scaling of QML models. Our results indicate that features deemed detrimental for VQA trainability can also lead to issues such as barren plateaus in QML. Consequently, our work has implications for several QML proposals in the literature. In addition, we provide theoretical and numerical evidence that QML models exhibit further trainability issues not present in VQAs, arising from the use of a training dataset. We refer to these as dataset-induced barren plateaus. These results are most relevant when dealing with classical data, as here the choice of embedding scheme (i.e., the map between classical data and quantum states) can greatly affect the gradient scaling.


I. Introduction
The future of data analysis is incredibly exciting.The quantum revolution promises new kinds of data, new kinds of models, and new information processing devices.This is all made possible because small-scale quantum computers are currently available, while larger-scale ones are anticipated in the future [1].The mere fact that users will run jobs on these devices, preparing interesting quantum states, implies that new datasets will be generated.These are called quantum datasets [2,3], as they exist on the quantum computer and hence must be analyzed on the quantum computer.Naturally this has led to the proposal of new models, so-called Quantum Neural Networks (QNNs) [4], for analyzing (classifying, clustering, etc.) such data.Different architectures have been proposed for these models: dissipative QNNs [5], convolutional QNNs [6], recurrent QNNs [7], and others [8,9].
Using quantum computers for data analysis is often called Quantum Machine Learning (QML) [10,11].This paradigm is general enough to also allow for analysis of classical data.One simply needs an embedding map that first maps the classical data to quantum states, and then such states can be analyzed by the QNN [12][13][14].Here, the hope is that by accessing the exponentially large dimension of the Hilbert space and quantum effects like superposition and entanglement, QNNs can outperform their classical counterparts (i.e., neural networks) and achieve a coveted quantum advantage [15][16][17][18][19].
Despite the tremendous promise of QML, the field is still in its infancy and rigorous results are needed to guarantee its success.Similar to classical machine learning, here one also wishes to achieve small training error [20] and small generalization error [21,22], with the second usually hinging on the first.Thus, it is crucial to study how certain properties of a QML model can hinder or benefit its parameter trainability.
Given the large body of literature studying barren plateaus in VQAs, the natural question that arises is: Are the gradient scaling and barren plateau results also applicable to QML? Making this connection is crucial, as it is not uncommon for QML proposals to employ certain features that have been shown to be detrimental for the trainability of VQAs (such as deep unstructured circuits [35,36] or global measurements [36]).Unfortunately, it is not straightforward to directly employ VQA gradient scaling results in QML models.Moreover, one can expect that in addition to the trainability issues arising in variational algorithms, other problems can potentially appear in QML schemes.This is due to the fact that QML models are generally more complex.For instance, in QML one needs to deal with datasets [3], which further require the use of an embedding scheme when dealing with classical data [12][13][14]48].Additionally, QML loss functions can be more complicated than VQA cost functions, as the latter are usually linear functions of expectation values of some set of operators.
In this work we study the trainability and the existence of barren plateaus in QML models.Our work represents a general treatment that goes beyond previous analysis of gradient scaling and trainability in specific QML models [49][50][51][52][53][54][55][56].Our main results are two-fold.First, we rigorously connect the scaling of gradients in VQA-type cost functions and QML loss functions, so that barren plateau results in the variational algorithms literature can be used to study the trainability of QML models.This implies that known sources of barren plateaus extend to the realm of training QNNs, and thus, that many proposals in the literature need to be revised.Second, we present theoretical and numerical evidence that certain features in the datasets and embedding can additionally greatly affect the model's trainability.These results show that additional care must be taken when studying the trainability of QML models.Moreover, they constitute a novel source for untrainability: dataset-induced barren plateaus.

A. Quantum Machine Learning
In this work, we consider supervised Quantum Machine Learning (QML) tasks.First, as depicted in Fig. 1(a) let us consider the case where one has classical data.Here, the dataset is of the form {x i , y i }, where x i ∈ X is some classical data (e.g., a real-valued vector), and y i ∈ Y are values or labels associated with each x i according to some (unknown) model h : X → Y .One generates a training set S = {x i , y i } N i=1 of size N by sampling from the dataset according to some probability distribution.Using S one trains a QML model, i.e. a parametrized map hθ : X → Y , such that its predictions agree with those of h with high probability on S (low training error), and on previously unseen cases (low generalization error).
For the QML model to access the exponentially large dimension of the Hilbert space, one needs to encode the classical data into a quantum state.As shown in Fig. 1(a), one initializes m qubits in a fiduciary state such as |0 = |0 ⊗m and sends it through a Completely Positive Trace-Preserving (CPTP) map E E xi , so that the outputs are of the form For instance, E E xi can be a unitary channel given by the action of a parametrized quantum circuit whose gate ro-

An embedding channel E E
x i maps the classical data onto quantum states ρi.Alternatively, quantum data points are of the form {ρi, yi}, where ρi are quantum states in a Hilbert space H, each with associated labels yi ∈ Y (e.g., ferromagnetic/paramagnetic phases).The quantum states (coming from classical or quantum datasets) are then sent through a parametrized quantum neural network (QNN), E QN N θ .By performing measurements on the output states one estimates expectation values i ≡ i(θ; yi) which are then used to estimate the loss function L(θ).Finally, a classical optimizer decides how to adjust the QNN parameters, and the loop repeats multiple times to minimize the loss function.
tation angles depends on the values in x i .While in general the embedding can be in itself trainable [14,57,58], here the embedding is fixed.
Next, consider the case when the dataset is composed of quantum data [3,6].As one can see from Fig. 1(b), here the dataset is of the form {ρ i , y i }, where ρ i ∈ R, and where R ⊆ H is a set of quantum states in the Hilbert space H.Then, y i ∈ Y are values or labels associated with each quantum state according to some (unknown) model h : R → Y .In what follows we simply denote as ρ i the states obtained from the dataset, and clarify when appropriate if they were produced using an embedding scheme or not.
The quantum states ρ i are then sent through a QNN, i.e., a parametrized CPTP map E QN N θ .Here, the outputs of the QNN are n-qubit quantum states (with n m) of the form We note that Eq. ( 2) encompasses most widely used QNN architectures.For instance, in many cases E QN N θ is simply a trainable parametrized quantum circuit [5][6][7][8].Here, θ is a vector of continuous variables (e.g., gate rotation angles).More generally, θ could also contain discrete variables (gate placements) [59][60][61].However, for simplicity, here we consider the case when one only optimizes over continuous QNN parameters.
The model predictions are obtained by measuring qubits of the QNN output states ρ i (θ) in Eq. ( 2).That is, for a given data point (ρ i , y i ), one estimates the quantity where O yi is a Hermitian operator.Here we recall that we use the term global measurement when O yi acts nontrivially on all n qubits.On the other hand, we say that the measurement is local when O yi acts non-trivially on a small constant number of qubits (e.g. one or two qubits).Finally, throughout training, the success of the QML model is quantified via a loss function L(θ) that takes the form where f (.) is first-order differentiable.By employing a classical optimizer, one trains the parameters in the QNN to solve the optimization task Finally, one tests the generalization performance by assigning labels with the optimized model h θ * .
For the purpose of clarity, let us exemplify the previous concepts.Consider a binary classification task where the labels are Y = {−1, 1}.Here, one possible option to assign label 1(−1) is to measure globally all qubits in the computational basis and estimate the number of output bitstrings with even (odd) parity.That is, the probability of assigning label y i is given by the expectation value where O 1 = z:even |z z| and O −1 = z:odd |z z|.Alternatively, rather than computing the probability of assigning a given label, the measurement outcome can be a label prediction by itself.For instance, the latter can be achieved with the global measurement as this is a number in Here, let us make several important remarks.First, we note that Eq. ( 6) and (7) are precisely of the form in Eq. ( 3), where one computes the expectation value of an operator over the output state of the QNN.Then, we remark that both approaches are equivalent due to the fact that O 1 − O −1 = Z ⊗n , and thus and conversely Finally, for the purpose of accuracy testing one needs to map from the continuous set of outcomes of ỹi (θ) or p i (y i |θ) to the discrete set Y .Thus, the QML model hθ assigns label Here it is also worth recalling two widely used loss functions in the literature.First, the mean squared error loss function is given by where ỹi (θ) is the model-predicted label for the data point ρ i obtained through some expectation value as in Eq. ( 3) (see for instance the label of Eq. ( 7)).Then, the negative log-likelihood loss function is defined as where p i (y i |θ) is the probability of assigning label y i to the data point ρ i obtained through some expectation value as in Eq. (3) (see for instance the probability of Eq. ( 6)).Moreover, we recall that the expectation of the negative log-likelihood Hessian is given by the Fisher Information (FI) matrix.In practice, one can estimate the FI matrix via the empirical FI matrix The FI matrix plays a crucial role in natural gradient optimization methods [62].Here, the FI measures the sensitivity of the output probability distributions to parameter updates.Hence, such optimization methods rely on the estimation of the FI matrix to optimize the QNN parameters.Below we discuss how such optimization methods are affected when the QML model exhibits a vanishing gradient.

B. Quantum Landscape Theory and Barren Plateaus
Recently, there has been a tremendous effort in developing the so-called Quantum Landscape Theory [20,63] to analyze properties of quantum loss landscapes, how they emerge, and how they affect the parameter optimization process.Here, one of the main topics of interest is that of Barren Plateaus (BPs).When a loss function exhibits a BP, the gradients vanish exponentially with the number of qubits, and thus an exponentially large number of shots are needed to navigate through the flat landscape.
Since BPs have been mainly studied in the context of Variational Quantum Algorithms (VQAs) [23,[35][36][37][38][39][40][41][42][43][44], we here briefly recall that in a VQA implementation, the goal is to minimize a linear cost function that is usually of the form Here ρ is the initial state, U (θ) a trainable parametrized quantum circuit, and H a Hermitian operator.BPs for cost functions such as that in Eq. ( 13) have been shown to arise for global operators H [36], as well as due to certain properties of U (θ) such as its structure and depth [35,37,44], its expressibility [40,41], and its entangling power [38,39,50].There are several ways in which the BP phenomenon can be characterized.The most common is through the scaling of the variance of cost function partial derivatives, as in a BP one has where ∂ ν C(θ) = ∂C(θ)/∂θ ν , θ ν ∈ θ, and for α > 1.
Here, the variance is taken with respect to the set of parameters θ.We recall that here one also has that E[∂ ν C(θ)] = 0, implying that gradients exponentially concentrate at zero.In addition, a BP can also be characterized through the concentration of cost function values, so that Var[C(θ A ) − C(θ B )] ∈ O(1/β n ), where θ A and θ B are two randomly sampled sets of parameters, and β > 1 [63].Finally, the presence of a BP can be diagnosed by inspecting the eigenvalues of the Hessian and the FI matrix, as these become exponentially vanishing with the system size [42,52] (see also Section III C below).Despite tremendous progress in understanding BPs for VQAs, there are only a few results which analyze the existence of BPs in QML settings [49][50][51][52][53][54][55].In fact, while in the VQA community there is a consensus that certain properties of an algorithm should be avoided (such as global measurements), these same properties are still widely used in the QML literature (e.g., global parity measurements as those in Eqs. ( 6) and ( 7)).Thus, bridging the VQA and QML communities would allow the use of trainability and gradient scaling results across both fields.

C. VQAs versus QML
Let us first recall that, as previously mentioned, training a VQA or the QNN in a QML model usually implies optimizing the parameters in a quantum circuit.However, despite this similarity, there are some key differences between the VQA and a QML framework, which we summarize in Table I.First, QML is based on training a QNN using data from dataset.Moreover, the input quantum stats to the QML model contain information that the model is trying to learn.On the other hand, in VQAs there is usually no training data, but rather a single input state sent trough a parametrized quantum circuit.Here, such initial state is generally an easy-toprepare state such as |0 .Then, while VQAs generally deal with quantum states, QML models are envisioned to work both on classical and quantum data.Thus, when using a QML model with classical data, there is an additional encoding step that depends on the choice of embedding channel E E xi .When the QML model deals with quantum data, the input states are usually non-trivial states.
Finally, we note that for most VQAs, the cost function of Eq. ( 13) is a linear function in the matrix elements of U (θ)ρU † (θ).For QML, however, the loss function of Eq. (4) can be a more complex non-linear function of the matrix elements of ρ i (θ) (see for instance the loglikelihood in Eq. (11)).Here, it is still worth noting that the expectation values i (θ; y i ) of Eq. ( 3) are exactly of the form of VQA cost functions, as both of these are linear functions of the parametrized output states.

III. Analytical Results
The differences between the VQA and QML frameworks discussed in the previous section mean that known gradient scaling results for VQAs do not automatically or trivially extend to the QML setting.Nevertheless, the goal of this section is to establish and formalize the connection between these two frameworks, as summarized in Fig. 2. Namely, in Section III A we link the variance of the partial derivative of QML loss functions of the form of Eq. ( 4) to that of partial derivatives of linear expectation values, which are the quantities of study for VQAs (see Eq. ( 13)).This allows us to show that, under standard assumptions, the landscapes of QML loss functions such as the mean-squared error loss in Eq. ( 10) and the negative log-likelihood loss in Eq. (11) have BPs in all the settings that standard VQAs do.In addition, as we discuss in Section III B, additional care should be taken to guarantee the trainability of a QML model, as here one has to deal with datasets and embeddings.This leads us to introduce a new mechanism for untrainability arising from the dataset itself.Finally, in Section III C we demonstrate that under conditions which induce a BP for the negative log-likelihood loss function, the matrix elements of the empirical Fisher Information matrix also exponentially vanish.
A. Connecting the gradient scaling of QML loss functions and linear cost functions In the following theorem we study the variance of the partial derivative for the generic loss function defined in Eq. ( 4).We upper bound this quantity by making a connection to the variance of partial derivatives of linear expectation values of the form of Eq. (3).As shown in Section B of the Appendix, the following theorem holds.
Theorem 1 (Variance of partial derivative of generic loss function).Consider the partial derivative of the loss function L(θ) in Eq. (4) taken with respect to variational parameter θ ν ∈ θ.We denote this quantity as ∂ ν L(θ) ≡ ∂L(θ)/∂θ ν .The following inequality holds where we used i (θ) as a shorthand notation for i (θ; y i ) in Eq. (3), and where the expectation values are taken over the parameters θ.Here we defined where ∂f ∂ i is the i-th entry of the Jacobian J with = ( 1 , . . ., N ), and where we denote as max i ∂f ∂ i the maximum value of the partial derivative of f ( i , y i ).
Theorem 1 establishes a formal relationship between gradients of generic loss functions and linear expectation values, which includes linear cost functions in VQA frameworks.This result provides a bridge to bound the variance of partial derivatives of QML loss functions based on known behavior of linear cost functions.As shown next, this allows us to use gradient scaling results from the VQA literature in QML settings.
The direct implication of Theorem 1 can be understood as follows.Consider the case when g i is at most polynomially increasing in n for all ρ i in the training set.Then, if a facet of the model causes linear cost functions to exhibit BPs, then L(θ) will also suffer from BPs.That is, if Var[∂ ν i (θ)] is exponentially vanishing, then so will be Var[∂ ν L(θ)].Let us now explicitly demonstrate the implications of Theorem 1 for the mean squared error and the negative log-likelihood cost functions.
Corollary 1 (Barren plateaus in mean squared error and negative log-likelihood loss functions).Consider the mean squared error loss function L mse defined in Eq. (10) and the negative log-likelihood loss L log defined in Eq. (11), with respective model-predicted labels ỹi (θ) and model-predicted probabilities p i (y i |θ) both of the form of Eq. (3).Suppose that the QNN has a BP for the linear expectation values, i.e., Eq. ( 14) is satisfied for all ỹi (θ) and p i (y i |θ).Then, assuming that ỹi (θ) ∈ O(poly(n)) ∀i we have for α > 1.Similarly, assuming that for α > 1.
We note the assumption on ỹi (θ) in Corollary 1 can be made to be generically satisfied as label values are usually bounded by a constant, and if not they can always be normalized by classical post-processing.The assumption on the possible values of p i (y i |θ) is equivalent to clipping model-predicted probabilities such that they are strictly greater than zero, which is a common practice when using the negative log-likelihood cost function [64,65].
Corollary 1 explicitly implies that, under mild assumptions, previously established BP results for VQAs are applicable to QML models that utilize mean square error and negative log-likelihood loss functions.Consequently, existing QML proposals with these BP-induced features, such those as summarized in Fig. 2, need to be revised in order to avoid an exponentially flat loss function landscape.
We note that, as shown in the Appendix D, the above results in Corollary 1 can also be extended to the generalized mean square error loss function used in multivariate regression tasks and to loss functions based on the Kullback-Leibler divergence used in generative modelling.

B. Dataset-induced barren plateaus
In this section we argue that the dataset can negatively impact QML trainability if the input states to the QNN have high levels of entanglement and if the QNN employs local gates (which is the standard assumption in most widely used QNNs [5][6][7][8]).This is due to the fact that reduced states of highly entangled states can be very close to being maximally mixed, and it is hard to train local gates acting on such states.Note that this phenomenon is not expected to arise in a standard VQA setting where the input state is considered to be a trivial tensor-product state.
To illustrate this issue, we will present an example where a VQA does not exhibit a BP, but a QML model can still have a dataset-induced BP.Consider a model where the QNN is given by a single layer of the so-called alternating layered ansatz [36].Here, E QN N θ is a circuit composed a tensor product of ξ s-qubit unitaries such that n = sξ.That is, the QNN is of the form U (θ) = ξ k=1 U k (θ k ), and where U k (θ k ) is a general unitary acting on s qubits.Furthermore, consider a linear loss function constructed from local expectation values of the form where |0 0| j denotes the projector on the j-th qubit, and where ρ i (θ) is defined in Eq. ( 2).We now quote the following result from Ref. [36] on the variance of the partial derivative of quantities of the form in Eq. (19).
Proposition 1 (From Supplementary Note 5 of Ref. [36]).Suppose that E QN N θ is given by an application of a single layer of the alternating layered ansatz consisting of a tensor product of s-qubit unitaries.Consider the partial derivative of the local cost in Eq. (19) taken with respect to a parameter θ ν appearing in the hth unitary.If each unitary forms a local 2-design on s qubits, we have where r n,s ∈ Ω(1/ poly(n)), and where we denote as the Hilbert-Schmidt distance.Here, 1 1 2 s is the the maximally mixed state on s qubits and ρ is the reduced input state on the s qubits acted upon by the h-th unitary.
First, let us remark that Proposition 1 shows that a standard VQA that uses the cost in Eq. ( 19) (tensor product ansatz and local measurement) does not exhibit a BP.This is due to the fact that when the input state to the VQA is However, for a QML setting, Proposition 1, combined with Theorem 1, implies that the QML model is susceptible to dataset-induced BPs even with simple QNNs and local measurements.Specifically, if the reduced input state is exponentially close to the maximally mixed state then one has a BP.Examples of such datasets are Haarrandom quantum states [66] or classical data encoded with a scrambling unitary [41], as the reduced states can be shown to concentrate around the maximally mixed state through Levy's lemma [67].
We note that this phenomenon is similar to entanglement-induced BPs [38,50].However, in a typical entanglement-induced BP setting, it is the QNN that generates high levels of entanglement in its output states and thus leads to trainability issues.In contrast, here it is the quantum states obtained from the dataset that already contain large amounts of entanglement even before getting to the QNN.
It is important to make a distinction between the dataset-induced BPs for classical and quantum data.Specifically, special care must be taken when using classical datasets as here one can actually choose the embedding scheme, and this choice affects the amount of entanglement that the states ρ i will have.Such a choice is typically not present for quantum datasets.
For classical datasets, let us make the important remark that (in a practical scenario) the embedding is not able to prevent a BP which would otherwise exist in the model.As discussed in Section II A, the embedding process simply assigns a quantum state ρ i = E E xi (|0 0|) to each classical data point x i .Thus, previously established BP results that hold independently of the input state such as global cost functions [36], deep circuits [35], expressibility [40] and noise [44] will hold regardless of the embedding strategy.
Our previous results further illuminate that the choice of embedding strategy is a crucial aspect of QML models with classical data, as it can affect that model's trainability.As argued in [12], a necessary condition for obtain-ing quantum advantage is that inner products of dataembedded states ) should be classically hard to estimate.We note that of course this however is not sufficient to guarantee that the embedding is useful.For instance, for a classification task it does not guarantee that states are embedded in distinguishable regions of the Hilbert space [14].Thus, currently, one has the following criteria on which to design encoders for QML: 1. Classically hard to simulate.

Practical usefulness.
From the results presented here, a third criterion should be carefully considered when designing embedding schemes:

Not inducing trainability issues.
This opens up a new research direction of trainabilityaware quantum embedding schemes.
Here we briefly note that while Proposition 1 was presented for a tensor product ansatz, a similar result can be obtained for more general QNNs such as the hardware efficient ansatz [36] or the quantum convolutional neural network [49].For these architectures it is found that Var[∂ ν i (θ)] is upper bounded by a quantities such as D HS ρ (h) i , 1 1 2 s .However, while the form of the upper bound is more complex and cumbersome to report, the dataset-induced BP will arise for these architectures.

C. Empirical FI matrix in the presence of a barren plateau
Recently, the eigenspectrum of the empirical FI matrix was shown to be related to gradient magnitudes of the QNN loss function [52].Here we investigate this connection in more detail.Namely, we discuss how a BP in the QML model affects the empirical FI matrix and natural gradient-based optimizers (which employ the FI matrix).
In what follows, we show that under the conditions for which the negative log-likelihood loss gradients vanish exponentially, the matrix elements of the empirical FI matrix F (θ), as defined in Eq. ( 12), also vanish exponentially.This result is complementary to and extends the results in [52].First, consider the following proposition.Proposition 2. Under the assumptions of Corollary 1 for which the negative log-likelihood loss function has a BP according to Eq. ( 14), and assuming that the number of trainable parameters in the QNN is in O(poly(n)), we have where Fµν (θ) are the matrix entries of F (θ) as defined in Eq. (12).
While Proposition 2 shows that the expectation value of the FI matrix elements vanish exponentially, this result is not enough to guarantee that (with high probability) the matrix elements will concentrate around their expected value.Next we present a stronger concentration result which is valid for QNNs where the parameter shift rule holds [68,69].
Corollary 2. Under the assumptions of Corollary 1 for which the negative log-likelihood loss function has a BP according to Eq. ( 14), and assuming that the QNN structure allows for the use of the parameter shift rule, we have where We note that these two results imply that when the linear expectation values exhibit a BP, then one requires an exponential number of shots to estimate the entries of the FI matrix, and concomitantly its eigenvalues.This also implies that optimization methods that use the FI, such as natural gradient methods, cannot navigate the flat landscape of the BP (without spending exponential resources).

IV. Numerical Results
In this section, we present numerical results studying the trainability of QNNs in supervised binary classification tasks with classical data.Specifically, we analyze the effect of the dataset, the embedding, and the locality of the measurement operator on the trainability of QML models.In what follows, we first describe the dataset, embedding, QNN, measurement operators and loss functions used in our numerics.
We consider two different types of datasets, one composed of structured data and the other one of unstructured random data.This allows us to further analyze how the data structure can affect the trainability of the QML model.First, the structured dataset is formed from handwritten digits from the MNIST dataset [70].Here, greyscale images of "0" and "1" digits are converted to length-n real-valued vectors x using a principal component analysis method [71] (see also Appendix F for a detailed description).Then, for the unstructured dataset we randomly generate vectors x of length n by uniformly sampling each of their components from [−π, π].In addition, each random data point is randomly assigned a label.
In all numerical settings that we study, the embedding xi is a unitary acting on n qubits, which we denote as V (x i ).Thus, the output state of the embedding is a pure state of the form |ψ(x i ) = V (x i )|0 .As shown in Fig. 3, we use three different embedding schemes for V (x i ).The first is the Tensor Product Embedding (TPE).The TPE is composed of single-qubit rotations around the x-axis 3. Circuits for embedding unitaries used in our numerics.(a) Tensor Product Embedding (TPE), composed of single qubit rotations around the x-axis.Here, the encoded state ρi is obtained by applying a rotation Rx on the j-th qubit whose angle is the j-th component of the vector xi.(b) Hardware Efficient Embedding (HEE).A layer is composed of single qubit rotation around the x-axis whose rotation angles are assigned in the same way as in the TPE.After each layer of rotations, one applies entangling gates acting on adjacent pairs of qubits.(c) Classically Hard Embedding (CHE).Each unitary W (xi) is composed of single-and two-qubit gates that are diagonal in the computational basis.so that |ψ(x i ) is obtained by applying a rotation on the j-th qubit whose angle is the j-th component of the vector x i .The second embedding scheme is presented in Fig. 3(b), and is called the Hardware Efficient Embedding (HEE).In a single layer of the HEE, one applies rotations around the x-axis followed by entangling gates acting on adjacent pairs of qubits.Finally, we refer to the third scheme as the Classically Hard Embedding (CHE).The CHE was proposed in [12], and is based on the fact that the inner products between output states of V (x i ) are believed to be hard to classically simulate as the depth and width of the embedding circuit increases.The unitaries W (x i ) in each layer of the CHE are composed of single-and two-qubit gates that are diagonal in the computational basis.We refer the reader to Appendix F 2 for a description of the unitaries W (x i ).
For the QNN in the model we consider two different ansatzes.The first is composed of a single layer of parametrized single-qubit rotations R y about the y-axis.
Here, the output of the QNN is an n-qubit state.The second QNN we consider is the Quantum Convolutional Neural Network (QCNN) introduced in [6].The QCNN is composed of a series of convolutional and pooling layers that reduce the dimension of the input state while preserving the relevant information.In this case, the output of the QNN is a 2-qubit state.We refer the reader to Appendix F 3 for a more detailed description of the QCNN we used.When using the QNN composed of R y rotations, we apply a global measurement on all qubits to compute the expectation value of the global operator Z ⊗n .Thus, the predicted label and label probabilities are given by Eqs. ( 7)- (9).Since global measurements are expected to lead to trainability issues, we also propose a local measurement where one computes the expectation value of Z ⊗2 over the reduced state of the middle two qubits.Note that this is equivalent to computing the parity of the length-2 output bitstrings.On the other hand, when using the QCNN, one naturally has to perform a local measurement as here the output is a 2-qubit state.Thus, here we also compute the expectation value of Z ⊗2 .
Finally, we note that in our numerics we study the scaling of the gradient of both the mean-squared error and log-likelihood loss functions of Eqs.(10) and Eq. ( 11), respectively.Moreover, we also consider the scaling of the gradients of the linear expectation value in Eq. (3).

A. Global measurement
Here we first study the effect of performing global measurements on the output states of the QNN.As shown in [36], we expect that linear functions of global measurements will have BPs.For this purpose we consider both structured (MNIST) and unstructured (random) datasets encoded through the TPE and CHE schemes (see Fig. 3).The QNN is taken to be the tensor product  of single qubit rotations and we measure the expectation value of Z ⊗n .Figure 4 presents results where we numerically compute the variance of partial derivatives of the linear expectation values in (3), the mean squared error loss function in Eq. (10), and the negative log-likelihood loss function in Eq. ( 11) versus the number of qubits n.The variance is evaluated over 200 random sets of QNN parameters and the dataset is composed of N = 10n data points.
Figure 4 shows that, as expected from [36], the variance of the partial derivative of the linear loss function vanishes exponentially with the system size, indicating the presence of a BP according to Eq. ( 14).Moreover, the plot shows that for all dataset and embedding schemes considered, the variance of the partial derivatives of the log-likelihood and mean squared error loss functions also vanish exponentially with the system size.This scaling is in accordance with the results of Corollary 1.
Here we note that the presence of BPs can be further characterized through the spectrum of the empirical FI matrix [52].As mentioned in Section III C, in a BP the magnitudes of the eigenvalues of the empirical FI matrix will decrease exponentially as the number of qubits increases.In Fig. 5(a) and (c) we plot the trace of the empirical FI matrix versus the number of qubits for the same structured and unstructured datasets of Fig. 4. As expected, the trace decreases exponentially with the prob- i , 1 1/4) versus the number of qubits.Here, ρ is the reduced state of the central two qubits.Panels (b) and (e) show the variances of the partial derivative of the log-likelihood loss function versus the number of qubits n for the tensor product QNN, with panels (c) and (f) for a QCNN.We also plot as reference the variances using the non-entangling TPE scheme when the loss is evaluated over the dataset and over a single data point.lem size when using a global measurement due to the loss function exhibiting a BP.While the trace of the empirical FI provides a coarse-grained study of the eigenvalues, we also show in Fig. 5(b) and (d) representative results for the eigenvalue spectrum distributions of the empirical FI matrix.One can see here that all eigenvalues become exponentially vanishing with increasing system size.
Our results here show that even for a trivial QNN, and independently of the dataset and embedding scheme, global measurements in the loss function lead to exponentially small gradients, and thus to BPs in QML models.Moreover, we have also verified that the eigenvalues of the empirical FI matrix are, as expected from Corollary 2, exponentially small in a BP, showing that an exponential number of shots are needed to accurately estimate the matrix elements, eigenvalues, and trace of the empirical FI matrix.This precludes the possibility of efficiently estimating quantities such as the normalized empirical FI matrix F (θ)/Tr[ F (θ)] [52].

B. Dataset and embedding-induced barren plateaus
Here we numerically study how the embedding scheme and the dataset can potentially lead to trainability issues.
i , 1 1/4) versus the number of qubits.Here, ρ is the reduced state of the central two qubits.Panels (b) and (e) show the variances of the partial derivative of the log-likelihood loss function versus the number of qubits n for the tensor product QNN, with panels (c) and (f) for a QCNN.We also plot as reference the variances using the non-entangling TPE scheme when the loss is evaluated over the dataset and over a single data point.
Specifically, we recall from Section III B that highlyentangling embedding schemes can lead to reduced sates being concentrated around the maximally mixed state, and thus be harder to train local gates on.To check how close reduced states at the output of the CHE and HEE schemes are, we average the Hilbert-Schmidt distance D HS (ρ i , 1 1/4) between the maximally mixed state and the reduced state ρ (2) i of the central two qubit.In addition, we further average over 2000 data points from a structured (MNIST) and unstructured (random) dataset.
Results are shown in Fig. 6(a) and (d), where we plot D HS (ρ i , 1 1/4) versus the number of qubits for the CHE scheme with different number of layers.As expected, here we see that increasing the number of layers in the embedding leads to higher entanglement in the encoded states ρ i , and thus to reduced states being closer to the maximally mixed state.Moreover, here we note that the structure of the dataset also plays a role in the mixedness of the reduced state, as the Hilbert-Schmidt distances D HS (ρ i , 1 1/4) for the unstructured random dataset can be up to one order of magnitude smaller than those for structured dataset.
In Fig. 6 we also show the variance of the log-likelihood loss function partial derivative as a function of the num-ber of qubits and the number of CHE layers.Here we use both the tensor product QNN (panels (b) and (e)) and the QCNN (panels (c) and (f)), and we compute local expectation values of Z ⊗2 .Moreover, here the dataset is composed of N = 10n points, and the variance is taken by averaging over 200 sets of random QNN parameters.Since both the tensor product QNN with local cost and the QCNN are not expected to exhibit BPs with no training data and separable input states (see [36] and [49], respectively), any unfavorable scaling arising here will be due to the structure of the data or the embedding scheme.To ensure this is the case, we plot two additional quantities as references.The first is obtained for the case when the embedding scheme is simply replaced with the TPE, representing the scenario of a non-entangling encoder.In the second, we also use the TPE encoder but rather than computing the loss function over the whole dataset, we only compute it over a single data point, i.e.L log (θ) = − log p i .Then, for this single-data point loss function, we study the scaling of the partial derivative variance and we finally average over the dataset, i.e.
i Var[∂ ν log p i ]/N .This allows us to characterize the effect of the size and the randomness associated with the dataset.
For the unstructured random dataset, we can see that Var[∂ ν L(θ)] appears to vanish exponentially for both QNNs and for all considered number of layers in the CHE.This shows that the randomness in the dataset ultimately translates into randomness in the loss function and in the presence of a BP.For the structured dataset, we can see that Var[∂ ν L(θ)] does not exhibit an exponentially vanishing behaviour for small number of CHE layers.However, as the depth of the embedding increases, the variances become smaller.In particular, when using a QCNN, increasing the number of CHE layers appears to change the behaviour of Var[∂ ν L(θ)] towards a more exponentially vanishing scaling.Finally, we observe that the variance of the loss function constructed from a single data point is always larger than the loss constructed from N data points.This indicates that the larger the dataset, the smaller the variance.
To further study this phenomenon, in Fig. 7 we repeat the calculations of Fig. 6 but using the HEE scheme instead.That is, we show the scaling of D HS (ρ k (x i ), 1 1/4) and Var[∂ ν L(θ)] versus the number of qubits for the HEE scheme with different number of layers and for structured (MNIST) and unstructured (random) datasets.
Here, the effect of the entangling power of the embedding on the Hilbert-Schmidt distance D HS (ρ k (x i ), 1 1/4) can be seen in panels (a) and (d) of Fig. 7. Therein one can see that as the number of layers of the HEE increases (and thus also entangling power [40]) the distance to the maximally mixed state vanishes exponentially with the system size.One can see here that, independently of the structure of the dataset, the large entangling power of the embedding scheme leads to states that are essentially maximally mixed on any reduced pair of qubits.As seen in Fig. 7(b), (c), (e) and (f), the latter then translates into an exponentially vanishing Var[∂ ν L(θ)] and thus a BP.
These results indicate that the choice of dataset and embedding method can have a significant effect on the trainability of the QML model.Specifically, QNNs that have no BPs when trained on trivial input states can have exponentially vanishing gradients arising from either the structure of the dataset, or the large entangling power of the embedding scheme.Moreover, these results show that the Hilbert-Schmidt distance can be used as an indicator of how much the embedding can potentially worsen the trainability of the model.

C. Practical usefulness of the embedding scheme and local measurements
As discussed in Section III B, a good embedding satisfies (at least) the following three criteria: classically hard to simulate, practical usefulness, not inducing trainability issues.In the previous sections, we have studied how the trainability of the model can be affected by the embedding choice and dataset.Here we point out another subtlety, namely, that "classically-hard-to-simulate" and "practical usefulness", do not always coincide.Particularly, we here show that the CHE scheme can lead to poor performance for some standard benchmarking test.
For this purpose we choose the QCNN architecture, which uses a local measurement and is known to not exhibit a BP [49], to solve the task of classifying handwritten digits '0' and '1' from the MNIST dataset (see also [72] for a similar study using QCNNs for MNIST classification).Then, we compare two choices for the embedding scheme.The first is a classically simulable scheme given by a single layer of the HEE, whilst the other is the (conjectured) classically hard to simulate two-layered CHE.
To make this comparison fair, both encoders are subjected to an identical setting: QCNN implemented on n = 8 qubits, compute the expectation values of Z ⊗2 , use the log-likelihood loss function, and have a training and testing datasets of respective size 400 and 40.In all cases the classical optimizer used is ADAM [73] with a 0.02 learning rate.For each training iteration, the expectation value measured from the QML model is fed into an additional classical node with a tanh activation function.
In Fig. 8, we show the training loss functions and test accuracy versus the number of iterations for 10 different optimization runs using each encoders.We observe that, despite being classically simulable, the model with HEE has a significantly better performance (above 90% test accuracy) than the model with CHE (around 65% accuracy) on both training and testing for this specific task.Hence, we we can see a particular embedding example where hard-to-simulate does not translate into practical usefulness.
We emphasize that this result should not be interpreted as the CHE scheme being generally unfavorable for practical purposes.Rather, that additional care should be taken when choosing encoders to suit specific tasks, and to highlight the challenge of designing encoders to satisfy all necessary criteria for achieving a quantum advantage.

V. Implications for the literature
Here we briefly summarize the implications of our results for the QML literature, specifically the literature on training quantum neural networks.
First, we have shown that features deemed detrimental to training linear cost functions in the VQA framework will also lead to trainability issues in QML models.This is particularly relevant to the use of global observables such as measuring the parity of the output bitstrings on all qubits, which have been employed in the QML literature.We remark that there is no a priori reason to consider the global parity.One could instead just measure a subset of qubits and assign labels via local parities, or even average the local parities across subsets of qubits.As shown in our numerics section, local parity measurements are practically useful and one can use them to optimize the model and achieve small training and generalization errors.
Second, our results indicate that QML models can exhibit scaling issues due to the dataset.Specifically, when the input states to the QNN have large amounts of entanglement, then the QNN's parameters can become harder to train as local states will be concentrated around the maximally mixed state.This is particularly relevant when dealing with classical data, as here one has the freedom to choose the embedding.This points to the fact that the choice of embedding needs to be carefully considered and that trainability-aware encoders should be prioritized and developed.
Unfortunately for the field of QML, data embeddings cannot solve BPs by themselves.In other words, as proven here, the choice of embedding cannot in practice mitigate the effect of BPs or prevent a BP that would otherwise exist for the QNN.For instance, the embedding cannot prevent a BP arising from the use of a global measurement, or from the use of a QNN that forms a 2-design.Hence, while embeddings can lead to a novel source of BPs, they cannot cure a BP that a particular QNN suffers from.
Finally, we show that optimizers relying on the FI matrix, such as natural gradient descent, require an exponential number of measurement shots to be useful in a BP.This is due to the fact that the matrix elements of the empirical FI matrix are exponentially small in a BP.Hence, quantities such as the normalized empirical FI matrix, which has been employed in the literature, are also inaccessible without incurring in an exponential cost.

VI. Discussion
Quantum Machine Learning (QML) has received significant attention due to its potential for accelerating data analysis using quantum computers.A near-term approach to QML is to train the parameters of a Quantum Neural Network (QNN), which consists of a parametrized quantum circuit, in order to minimize a loss function associated with some data set.The data can either be quantum or classical (as shown in Fig. 1), with classical data requiring a quantum embedding map.While this novel, general paradigm for data analysis is exciting, there are still very few theoretical results studying the scalability of QML.The lack of theoretical results motivates our work, in which we focused on the trainability and gradient scaling of QNNs.
In the context of trainability, most of the previous results have been derived for the field of Variational Quantum Algorithms (VQAs).While VQAs and QML models share some similarities in that both train parametrized quantum circuits, there are some key differences that make it difficult to directly apply VQA trainability results to the QML setting.In this work, we bridged the gap between the VQA and QML frameworks by rigorously showing that gradient scaling results from VQAs will hold in a QML setting.This involved connecting the gradients of linear cost functions to those of mean squared error and log-likelihood cost functions.
In light of our results, many QML proposals in the literature would need to be revised, if they aim to be scalable.For instance, we rigorously proved that features deemed detrimental for VQAs, such as global measurements or deep unstructured (and thus highly entangling) ansatzes, should also be avoided in QML settings.These results hold regardless of the data embedding, and hence one cannot expect the data embedding to solve a barren plateau issue associated with a QNN.
Moreover, due to the use of datasets, we discovered a novel source for barren plateaus in QML loss functions.We refer to this as a Dataset-Induced Barren Plateau (DIBP).DIBPs are particularly relevant when dealing with classical data, as here additional care must be taken when choosing an embedding scheme.A poor embedding choice could lead to a DIBP.Until now, a "good" embedding was one that is classically hard to simulate and practically useful.However, our results show that a third criterion must be added for the encoder: not inducing gradient scaling issues.This paves the way towards the development of trainability-aware embedding schemes.
Our numerical simulations verify the DIBP phenomenon, as therein we show how the gradient scaling can be greatly affected both by the structure of the dataset, as well as by the choice of the embedding scheme.Furthermore, our results illustrate another subtlety that arises when using classical data, as the classically-hardto-simulate embedding of [12] leads to large generalization error on a standard MNIST classification task.Thus, "classically-hard-to-simulate" and "practical usefulness" of an encoder do not always coincide.
Taken together, our results illuminate some subtleties in training QNNs in QML models, and show that more work needs to be done to guarantee that QML schemes will be trainable, and thus useful for practical applications.
For the result on the reverse KL divergence, we note that the loss function is a sum of functions f (P (x, θ), x) = −P (x; θ) log Q(x) P (x;θ) , leading to By substituting this into Eq.( 15) of Theorem 1, we have where |X| is the cardinality of X.Under the assumption Q(x), P (x; θ) ∈ [b, 1] ∀x, θ, where b ∈ 1/Ω(poly(n)) and Eq. ( 14) is satisfied P (x; θ) ∀x, we obtain Eq. (D7) as required.
We remark that similar to the negative log-likelihood, the assumption on Q(x) and P (x; θ) to clip their values to be strictly greater than zero is common in practice when using loss functions based on the KL-divergence [65].

E. Fisher information matrix results
In this section we show that under the conditions for which the logarithmic loss function shows a BP, the matrix elements of the FI matrix probabilistically exponentially vanish.Proposition 1.Under the assumptions of Corollary 1 for which the negative log-likelihood loss function has a BP according to Eq. ( 14), and assuming that the number of trainable parameters in the QNN is in O(poly(n)), we have where Fµν (θ) are the matrix entries of F (θ) as defined in Eq. (12).
Proof.From the proof of Theorem 1 we have for all i.We note that the expectation value is similarly bounded where the first equality is an application of the chain rule, the second equality is due to the definition of the covariance, the first inequality is due to the Cauchy-Schwarz inequality, and in the final equality we have used the definition of g i in Eq. (B2).This enables us to write the bound where again, the first two inequalities come from the definition of the covariance and an application of the Cauchy-Schwarz inequality, and in order to obtain the third inequality we have used Eq.(E2) and Eq.(E6).We now consider the negative log-likelihood loss function, that is where i = p i are probabilities and f ( i , y i ) = log p i for all i.Then, if our assumptions are satisfied, namely Eq. ( 14) is satisfied for all p i (θ), and p i (θ) ∈ [b, 1] ∀i, θ with b ∈ Ω(1/ poly(n)), we have for all θ µ and θ ν .By inspecting Eq. (E10) and Eq.(E21) and using Chebyshev's inequality we have 1. Assuming log-likelihood loss has BP is insufficient to show exponentially small FI matrix elements To prove the above results, we make the assumption that linear expectation values have exponentially vanishing variance of their partial derivatives (i.e. they display a barren plateau).Here we show that relaxing these assumptions, such that the negative log-likelihood loss function has a plateau, is insufficient to guarantee exponentially vanishing elements of the FI matrix.We remark that this is an artefact of the fact that in QML, the loss functions are constructed from many data points.Consider the case where the dataset consists of 2 data points {(x 1 , y 1 ), (x 2 , y 2 )}.The log-likelihood in this case reads In the presence of barren plateaus in the log-likelihood loss landscape, we have where in the second line we have used the fact that E[∂ ν L log ] = 0 [36,49,50].From Eq. (E25) it is clear to see that the barren plateau condition (Var[∂ ν L log (θ)] being exponentially vanishing) can be satisfied with {log(p(y 1 |x i ; θ)} i∈{1,2} having exponentially similar magnitudes but opposite signs across the landscape.Hence, in order to guarantee exponentially small FI matrix elements, it is not sufficient to only assume that the negative log-likelihood loss function has a barren plateau.Rather, extra assumptions have to be made, for instance that the linear expectation values have a barren plateau, as assumed in Corollary 1.

F. Numerical implementations
In this section we provide technical details for our numerical results.

Dimensional reduction of features on the MNIST dataset
Here we describe explicitly how the images in the MNIST dataset are reduced to length-n real-valued vectors using principal component analysis (PCA) [71].Consider a dataset {x i , y i } N i with N data points, such that each input x i is a vector of length 784 (obtained by vectorizing the gray-scaled 28 × 28 image).We perform PCA on the dataset with the following steps.
1. Compute the average of the input data points x avg = i x i /N .4. From the matrix X, construct a square matrix X T X = i xi T xi of size 784 × 784 and perform an eigenvalue decomposition of X T X.

5.
Keep the n eigenvectors corresponding to the largest n eigenvalues of X T X.These are the principal components of the dataset.Construct a matrix M of size 784×n with n column vectors corresponding to these n eigenvectors.
6. Compute XM , leading to a new data matrix of size N ×n.This matrix multiplication corresponds to a projection of each individual input data point into the sub-space formed by these n principal components.

H H
FIG. 9. CHE architecture.One layer of the embedding is comprised of an RZ rotation on each qubit, where the rotation angle on the j-th qubit corresponds to the j-th component of xi, followed by a series of two-qubit ZZ gates on all pairs of qubits.For each pair of qubits j and k, the two-qubit ZZ gate encodes the product the j-th and k-th components of xi.

CHE architecture
Here, we describe in detail the architecture of the CHE in Fig. 3(c).The embedding, originally proposed in [12], is based on the Instantaneous Quantum Polynomial (IQP) architecture which takes the form of U IQP = H ⊗n U Z H ⊗n where H is the Hadamard gate and U Z is an arbitrary random diagonal unitary acting on all qubits.Sampling from the output distribution of this IQP circuit has been analytically shown to be classically hard to simulate and proposed as one of the early approaches for demonstrating quantum supremacy [74].In a similar fashion, the CHE scheme has been conjectured to be classically hard to simulate for more than two layers.As shown in Fig. 9, one layer consists of the Hadamard gates on all qubits followed by the data-encoded unitary W (x i ) which consists of single and two-qubit gates diagonal in the computational basis.Specifically, given that x i has length n, W (x i ) is defined as (j) i is the j-th component of x i , and Z j is the Pauli-Z operator on j-th qubit.We note that e −ix (j) i Zj is a single-qubit rotation on j−th qubit encoding j-th component of the x i , and e −ix (j) i x (k) i Zj Z k is a two-qubit ZZ gate that encodes the product of the j-th and k-th components of the data.The two-qubit ZZ gates act on all possible pairs of qubits.

QCNN architecture
We now provide details on the QCNN architecture used in this work.QCNNs are motivated by the structure of classical convolutional neural networks (which in turn are motivated by the structure of the visual cortex).Thus, in a QCNN, a series of convolutional layers are interleaved with pooling layers which reduce the number of degrees of freedom, while preserving the relevant features of the input state [6].Effectively, after each pooling layer, the number of remaining qubits is reduced by (about) half.Thus, for an initial input of n qubits, the total depth of the QCNN is O(log(n)).As illustrated in Fig. 10, the convolutional layer is comprised of two layers of parametrized two-qubit unitary blocks acting on alternating pairs of nearest-neighbor qubits.In the pooling layer, we apply CNOT gates on pairs of nearest-neighbor qubits where the controlled qubits are discarded for the later parts of the circuit.We note that, although one could consider more general measurements and controlled unitaries in the pooling layers in a more general setting, we do not consider such measurements in our numerics.Finally, after reducing down to two qubits, another parametrized unitary block (as in the convolutional layer) together with parametrized single-qubit R X rotation gates are applied before the measurement.
Pooling FIG. 10.QCNN architecture.We illustrate how the QCNN circuit is constructed for 8 qubits.After all convolutional and pooling layers reducing the system size is reduced down to 2 qubits.Then, one applies a final layer comprised of a parametrized 2-qubit unitary block and parametrized single qubit RX rotations before performing measurements.We also present the structure of the parametrized 2-qubit unitary blocks utilized in this work .

FIG. 1 .
FIG. 1. Schematic diagram of a QML task.Consider a supervised learning task where the training dataset is either (a) classical or (b) quantum.The classical data points are of the form {xi, yi}, where xi ∈ X are input data (e.g.pictures) and yi ∈ Y are labels associated with each xi (e.g., cat/dog).An embedding channel E Ex i maps the classical data onto quantum states ρi.Alternatively, quantum data points are of the form {ρi, yi}, where ρi are quantum states in a Hilbert space H, each with associated labels yi ∈ Y (e.g., ferromagnetic/paramagnetic phases).The quantum states (coming from classical or quantum datasets) are then sent through a parametrized quantum neural network (QNN), EQN N

FIG. 2 .
FIG. 2. Summary of results.(a)In this work we bridge the gap between trainability results for the VQA and QML settings.Namely, we show that the considered class of supervised QML models will exhibit a barren plateau (BP) in all settings that standard VQAs do.This means that features such as global measurements, deep circuits, or highly entangling circuits should be avoided in QML.(b) We present analytical and numerical evidence that aspects of the dataset can be an additional source of BPs.For instance, embedding schemes for classical data can lead to states that have highly mixed reduced states and thus display trainability issues.This is a novel source for BPs which we call dataset-induced BPs.

FIG. 4 .
FIG. 4. Variance of the partial derivative versus number of qubits.Here we consider linear expectation, loglikelihood and mean squared error loss functions with global measurement.In (a)-(b) we consider an unstructured (random) datasets, while in (c)-(d) a structured (MNIST) dataset.The classical data is encoded via the TPE (a)-(c) and the CHE (b)-(d) schemes.We plot the variance of the partial derivative versus number of qubits for all loss functions.The partial derivative is taken over the first parameter of the tensor product QNN.

FIG. 5 .
FIG.5.Trace and spectrum of the Fisher Information matrix in a BP.The top panels correspond to an unstructured (random) dataset, with the bottom to a structured (MNIST) dataset.In both cases the data is encoded using the CHE scheme and then sent through the tensor product QNN with a global measurement.In (a) and (c), we plot the variance of the partial derivatives of the log-likelihood loss function, and the expectation value of the trace of the empirical FI matrix versus the number of qubits n.Here, the expectation values are taken over 200 different sets of QNN parameters.In (b) and (d), we show the eigenvalues distribution of the empirical FI matrix for increasing numbers of qubits.

FIG. 6 .
FIG.6.Effect of CHE scheme with local measurement on trainability.The top panels correspond to an unstructured (random) dataset, with the bottom to a structured (MNIST) dataset.In all cases we used the CHE scheme with different number of layers.Panels (a) and (d), show the Hilbert-Schmidt distance DHS(ρ

FIG. 7 .
FIG. 7. Effect of HEE scheme with local measurement on trainability.The top panels correspond to an unstructured (random) dataset, with the bottom to a structured (MNIST) dataset.In all cases we use the HEE scheme with increasing number of layers.Panels (a) and (d), show the Hilbert-Schmidt distance DHS(ρ

FIG. 8 .
FIG.8.Loss function (a) and test accuracy (b) versus number of iterations for MNIST classification using a QCNN.We train the QML model with n = 8 qubits using two different embedding schemes (1 HEE layer and 2 CHE layers) for the task of binary classification between digits '0' and '1' from the MNIST dataset.Solid lines represents the average over 10 instances of different initial parameters, and shaded areas represents the range over these instances.

2 .
Normalize each individual input data point as xi = x i − x avg for all i.

3 .
Construct a matrix of this normalized dataset X = (x 1 , x2 , ..., xN ) T of size N × 784 where the i−th row of the matrix X represents a i−th input data point, x T i .