Quantum Machine Learning for Particle Physics using a Variational Quantum Classifier

Quantum machine learning aims to release the prowess of quantum computing to improve machine learning methods. By combining quantum computing methods with classical neural network techniques we aim to foster an increase of performance in solving classification problems. Our algorithm is designed for existing and near-term quantum devices. We propose a novel hybrid variational quantum classifier that combines the quantum gradient descent method with steepest gradient descent to optimise the parameters of the network. By applying this algorithm to a resonance search in di-top final states, we find that this method has a better learning outcome than a classical neural network or a quantum machine learning method trained with a non-quantum optimisation method. The classifiers ability to be trained on small amounts of data indicates its benefits in data-driven classification problems.


Introduction
To discover new physics at the LHC, highly complex rare signal events have to be separated from a large number of Standard Model background events. Novel reconstruction techniques often rely on machine learning algorithms which show an outstanding ability to find correlations in high-dimensional parameter spaces to discriminate signal from background processes. In collider phenomenology, the feature space on which the machinelearning methods are trained to classify events into signal and background consists usually of physical observables of reconstructed objects, e.g. the transverse momentum of a jet p T,j or the total amount of missing transverse energy / E T . The most popular machine learning techniques in recent years are artificial neural networks (NN), which are built on three pillars: i. an adaptable complex system that allows approximating a complicated function, ii. the calculation of a loss function in the output layer which is used to define the task the NN algorithm should perform by minimising this function, and iii. a way to update the network continuously while minimising the loss function, e.g. through backpropagation.
In NNs the adaptable system in (i) consists of a variable number of layers made of interconnected neurons. The neurons receive inputs from previous layers in terms of weights and a bias, which are then processed as arguments of an activation function. Due to its modular setup and variable complexity an NN can be trained to perform a large spectrum of tasks * . Quantum machine learning is an emergent research field which aims to release the prowess of quantum computing to improve machine learning methods. At the moment a full quantum neural network, where all three pillars are combined in an algorithm that is entirely built on the principles of quantum information processing, is not attainable. However, with present or at least with near-term quantum devices dedicated quantum algorithms can support pillars (i)-(iii) individually in form of a hybrid quantum machine learning approach.
In relation to NNs novel techniques are being developed and applied in a beneficial way for each of the pillars (i)-(iii) above. For example and concreteness, to support (i) quantum nodes can be connected with each other to form a variational circuit [8][9][10] or added to a classical neural network in a hybrid approach [11,12]. For (ii) the loss function can be calculated using a quantum algorithm, e.g. in variational quantum algorithm approaches [13,14]. Whereas for (iii) one can minimise the loss function using a quantum annealer [15,16] or a quantum optimisation algorithm [17,18].
Thus, by combining quantum computing methods with a NN and applying this approach to tackle challenges in particle physics we aim to foster an increase of performance associated with quantum computing algorithms † which can then translate into an improved sensitivity in searches for novel physical phenomena. The most important task for a machine learning application in particle physics is classification. To pursue this task using quantum machine learning, we construct a novel hybrid neural network, based on a quantum variational classifier. Quantum variational classifiers are known to have an advantage in model size compared to classical neural networks [10]. This allows us to augment the optimisation process of our hybrid network using the quantum gradient descent (QGD) method, which is inspired by the natural gradient descent method. Such complex optimisation methods are often computationally prohibitive for deep neural networks. Variational quantum classifiers are structurally very similar to classical neural networks and provide therefore an instructional framework to discuss in how far (i)-(iii) of classical NNs can be augmented using quantum computing elements.
Specifically, we use a quantum neural network approach, i.e. trainable quantum nodes connected in a circuit, and also include classic neural network elements, i.e. a bias term. * Applications range from playing Go [1] or Chess [2], over classification and image recognition [3,4] to natural language processing [5] and generative algorithms [6]. Beyond classification and the regression of data points NN can also be used to find solutions to functionals and integro-differential equations [7]. † Relevant to particle physics, a recent surge in proposals has emerged in how quantum computing can provide benefits for a variety of tasks. Quantum annealers, for example, perform continuous time quantum computations and are therefore well-suited to study the dynamics of quantum systems, even quantum field theories [19,20], and in solving optimisation problems [21]. Quantum gate computers are in particular a popular choice to calculate multi-particle processes [22][23][24][25][26][27][28][29][30][31][32], often with field theories mapped onto a discrete quantum walk [33][34][35][36] or a combined hybrid classical/quantum approach [37][38][39][40].
During training, we use a modified quantum optimisation algorithm, based on quantum gradient descent [18], designed to account for the classic elements of our model. We apply this method to a Z resonance search, which decays to a pair of top quarks [41][42][43]. This provides a timely and realistic playground for a phenomenologically relevant classification problem. Samples of top quark pairs where one top quark decays hadronically and the other leptonically can be purified to a very high degree, i.e. the confidence that one trains on a pure tt sample is very high. Although tt production results in jet and lepton-rich final states, for the purpose of a transparent discussion of how variational quantum classifiers can be used to support searches of new physics, we limit ourselves to only two feature variables as input to the NNs, i.e. the transverse momentum of the hardest bottom quark p T,b 1 and the amount of missing transverse energy in the event / E T . Extending the feature space is conceptually straightforward and will improve the networks ability to discriminate between signal and background, however, it will impact on the size of the network and the number of qubits, which would prevent us from running our hybrid quantum neural network on a real quantum device.
The paper is structured as follows: Section 2 is dedicated to a pedagogical overview to variational quantum classifiers (VQC) and to how VQC contribute to pillars (i) and (ii) in the context quantum machine learning algorithms. Section 3 addresses pilar (iii), where we introduce various optimisation methods applicable to the training of the NN during backpropagation. In Section 4 we outline the technical setup for the analysis on pseudodata. Subsequently, as described in Section 5, we train and test two different quantum machine learning models. One will use an entirely classical approach of gradient descent while the other is trained using a quantum optimisation method. To provide a baseline we compare to a classical neural network. Finally, we provide a summary and concluding remarks in Section 6.

Structure of a Variational Quantum Classifier
Variational quantum classifiers are a form of quantum neural network that can be used for supervised learning. This is achieved by designing a quantum circuit that behaves similarly to a traditional machine learning algorithm. The quantum machine learning algorithm contains a circuit which depends on a set of parameters that, through training, will be optimised to reduce the value of a loss function. This trained circuit is described in functional form by where f is the network, y is the network output used to calculate the loss function L, the network has trainable parameter w, b and input data x . Thus, the structure of a VQC shares many similarities with a traditional neural network. In both cases, the network f is built from discrete modular blocks, i.e. nodes in the classical neural network while a quantum circuit is composed of quantum gates, and share techniques used for training. Our classifier, is designed as a circuit-centric quantum classifier [10]. It is structurally depicted in Figure 1 and consists of three parts: (1) the state preparation circuit, (2) the model circuit and (3) the measurement and postprocessing. These three parts of our model Figure 1: A variation classifier described by 3 parts. The state preparation circuit is desgined to take our input, x ∈ R n , and encode it in a N-qubit quantum state. The model circuit will apply trainable and non-trainable gates to this state. In the final steps we measure the states and apply any postprocessing necessary. This model is inspired by circuit-centric quantum classifiers [10].
can be related in turn to the three pillars of machine learning, discussed in Section 1. Our classifier corresponds to (i) a complex adaptable system that (ii) calculates the value of a loss function. The continuous adaptation of the parameters w and b, after obtaining y through a measurement with the aim to continually reduce the loss function L, directly relates to the network optimisation of (iii).
More specifically, the state preparation step, shown in Figure 2, encodes the input data to an N-qubit quantum state. In classical computer algorithm this is carried out with bits, whereas on a quantum computer this is performed using qubits. A qubit is a 2-state quantum system which can be parametrised by (2. 2) The state of Eq. (2.2) can be visualised as a vector on the Bloch sphere. By performing operations on a qubit one rotates the vector on the Bloch sphere. Circuits can be constructed to act on numerous qubits, where a 2-qubit state can be described as a tensor product of two 1-qubit states |ψ = α 00 |00 + α 01 |01 + α 10 |10 + α 11 |11 .
The model circuit is constructed from gates that evolve the input state. The circuit is based on unitary operations and depends on external parameters which will be adjusted during training.
Finally, the postprocessing step measures the state. Traditionally, we measure the output of the first qubit. This step will also include any classical postprocessing we may wish to include.

State Preparation
Before applying the model circuit of our classifier, we use a state preparation circuit S x to encode the input data into a quantum state. S x acts on the initial state |φ where |φ = |0 ⊗n . The number of qubits n is defined by the number of features in our dataset.
The parametrisation of the encoding can affect the decision boundaries of the classifier and can therefore be chosen in a form that suits the problem at hand [44]. Here, we use the so-called angle encoding Practically, this amounts to using the input data, x, as angles in a unitary quantum gate. We take the state preparation circuit as the unitary gate

Model Circuit
Given a prepared state, |x , the model circuit, U (w), maps |x to another vector |ψ = U (w)|x . In turn U (w) consists of a series of unitary gates and can be decomposed as where every U l (w l ) is a layer in the circuit, with its corresponding weight parameters, and l max is the maximum number of layers. These are constructed from a set of single and two qubit gates which will evolve the state |x . The gates include parameters that will be trained during the optimisation of the network. A single qubit gate can be written as a 2 × 2 unitary matrix with the form We can neglect e iφ as it only gives rise to a global phase that has no measurable effect. Thus, the parameters α, β, and γ suffice to parametrise a single qubit gate. We use a rotation gate, R, and CNOT in our model. The rotation gate is a single qubit gate that is applied to both qubits in our system. This gate is designed to rotate our state based on a set of learnable parameters w = (α, β, γ) The angles of Eq. 2.9 are a subset of the parameters in the weight vector w ∈ R n×3×l , where n is the number of qubits and l is the number of layers in our network. This object, w, will contain some of the parameters that will be learned during training time. The number of qubits will mirror the number of features in our dataset whereas l is a hyperparameter we can tune. In the circuit centric design we are using the number of qubits is held constant, however, the model could be extended or other frameworks used for a more flexible network design [9].
Each layer in our model contains two CNOT gates, a standard 2-qubit gate in quantum computing with no learnable parameters. A CNOT, if used alongside a Hadamard gate, could be used to introduce entanglement into our circuit. These gates flip the state of a qubit based on the value of another control bit ‡ . Each gate in the layer uses a different qubit as the control bit. The model circuit of the VQC used here is shown in Figure 2.

Measurement and Postprocessing
After applying U (w) to the initial state we need to measure its output. We do this by applying the Pauli Z operator on the first qubit and taking the expectation value (2.10) ‡ The controlled NOT (CNOT) gate is a quantum register that can be used to entangle and disentangle quantum states. The matrix representation of a CNOT gate is where O = σ z ⊗ I ⊗(n−1) . To obtain an estimate we run the circuit repeatedly. The number of repetitions we do is known as the number of Shots (S). Classical postprocessing is applied to the expectation value of the circuit before returning a final classifier output. Like in a classical neural network approach, the postprocessing step gives a great deal of flexibility to the user to tackle the problem how they see fit.
Generally, it will include the addition of any bias terms, the drawing of a classification decision boundary, the calculation of a loss function and the optimisation procedure.
The bias term b will be a trainable parameter. Its introduction increases model flexibility and ensures the classifier output is continuous. We can write the output of our model, before thresholding, by combining the expectation value of the model circuit π(w, x) and the bias term b: (2.11) A decision boundary is drawn to seperate the value of f (w, b, x) into the two classes. The binary classification result, cls(w, b, x), is calculated as The final steps, the calculation of the loss function and carrying out the optimisation procedure will be discussed in Section 3.

Optimisation
As alluded to above, during training we aim to find the values of w and b to optimise a given loss function. We can perform optimisation on a quantum neural network similar to how it is done on a classical neural network. In both cases, we perform a forward pass of the model and calculate a loss function. We can then backpropagate over the network and update our trainable parameters. This is the equivalent of the third pillar of machine learning, mentioned in Section 1.
During training we use the mean squared error (MSE) as loss function § . This allows us to find a distance between our predictions and truth, represented by the value of the loss function (3.1) We will train our model using vanilla gradient descent and quantum gradient descent [].
The latter is a quantum optimisation algorithm designed to be performed on a hybrid network we have proposed above. Further, we will exploit the advantage in model size of a variational quantum classifier compared to a classical neural network to improve its backpropagation method. § Often, for classification tasks using classical neural networks, the cross entropy is a preferred measure for the loss function. We find the difference between cross entropy and MSE to be irrelevant for the application discussed in Sec. 5 and therefore follow the choice for the loss function of Refs. [10,45]

Backpropagation
To perform backpropagatation for a network with adjustable parameters θ = (w, b) we compute the change of it's output when varying θ as the gradient ∂ ∂θ f . For a quantum circuits the gradient of the network output is calculated using the parameter-shift rules [46,47]. Being able to calculate gradients for quantum circuit outputs opens up the possibility of using gradient descent methods to train our variational quantum circuit. The methodology is identical to how optimisation and training is performed on classical neural networks.
For the parameter-shift rules to be correctly applied to a quantum circuit certain conditions must be met. We can represent a unitary gate in the form where θ are our network parameters and V is the Hermitian generator of U (θ). For a circuit f that includes gates that can be represented in the form of Eq. (3.2), if V has at least two distinct eigenvalues, the parameter-shift rules provide the relation where the shift s = π/4r. The value of r is an arbitrary normalisation factor which we choose in our implementation to be r = 1/2. Following Eq. (3.3) we can calculate gradients over quantum gates by shifting the parameters. As the difficulty of calculating ∂ ∂θ f has been reduced to probing the quantum circuit at different parameter points, it is now possible to evaluate the gradient fast and efficiently on a quantum device.

From Gradient Descent to Quantum Gradient Descent
The geometry of the parameter space has a direct impact on the reliability and efficiency of an optimisation algorithm [48]. Thus, a suitable choice of optimisation strategy is a key performance factor for a variational quantum circuit. It is an open question as to what is the best form of parameter space to use and whether l 2 Euclidean geometry is an appropriate choice for variational models [49].
For our problem at hand, we propose to augment the vanilla gradient descent method, often used in classical neural networks, by the quantum gradient descent method [18].
In the vanilla gradient descent method, the network parameter vector θ t at iteration step t is updated with the goal that θ t+1 results in a smaller loss function L(θ). Thus, one approach is to update θ t in the direction of the steepest decline, − L(θ), weighted by the learning rate η θ t+1 = θ t − η L(θ). (

3.4)
However, this optimisation is performed on the geometry of an l 2 vector space, which influences the performance and how new parameters are found. While all parameters are updated with the same step size, the rate at which the loss function changes for each model parameter can vary greatly. By using this form of gradient descent it is possible to miss the global minimum in the space of the loss function. An improvement would be to change the coordinate system to ensure the loss function changed consistently with each step for each parameter or to find an optimisation method that was invariant under re-parametrisation. One way to address this problem is the use of natural gradient descent, which makes use of the Fisher Information Matrix [50,51] and is a classical extension to vanilla gradient descent method. The parameters of a network (the weights and biases) exist on a parameter space that has a Riemannian geometry. The Fisher Information Matrix is the metric that defines this space. Since this metric includes information on the geometric structure of the Riemannian space of the network parameters, its inclusion into the gradient descent optimisation leads the network to learn more effectively. In addition, it is invariant under re-parametrisation, and thus advantageous in finding an effective parametrisation. Algorithmically natural gradient descent can be written as where F is the Fisher Information Matrix. In each optimisation step, the parameters are updated in the direction of steepest descent of the information geometry rather than the Euclidean l 2 geometry. Although the inclusion of F −1 in Eq. (3.5) in general improves the performance of the optimisation algorithm, in most classical deep neural networks calculating the inverse of a large matrix becomes computationally prohibitively expensive. However, in our hybrid network, which benefits from a small model size, the parameter space is rather small. Thus, our aim is to use a quantum optimisation equivalent of this method that we can use on variational circuits. The parameter space of quantum states does indeed have a geometry that can be described by an invariant metric. Similar to how the Fisher Information Matrix is used to promote the gradient descent method to the natural gradient descent method, the Fubiny-Study metric g, derived and elaborated on in Appendix A, exploits the geometric structure of the variational quantum classifier's parameter space to establish the quantum gradient descent method. Here, the optimisation algorithm reads [18] θ t+1 = θ t − ηg + L(θ) , (3.6) where g + is the pseudo-inverse of the Fubini-Study metric. We implement this algorithm using the PennyLane package [52], which will allow us to find the steepest descent in the parameter space of the quantum states. The approach of Eq. (3.6) is designed to optimise the parameters of the quantum variational circuit only, i.e. the quantum gates with trainable parameters w = (α, β, γ). To perform a full optimisation of our hybrid model we need to consider the classical components of our model -the bias. Thus, we propose to optimise our weights using quantum gradient descent (3.7) while using vanilla gradient descent for the classical bias term b. Calculating both gradients at each optimisation step, ensures our entire range of parameters is optimised simultaneously.

Analysis Setup
The background and signal samples used here consist of pp → tt events and pp → Z → tt events, respectively. The background events have been generated with a centre-of-mass energy of 14 TeV. When the top quarks are decayed we have forced one quark to have a hadronic decay while the other has a leptonic decay. A heavy new boson, Z [54], is used as signal, with a mass of 2 TeV and a width chosen to be 89.6 GeV [55]. Similar to the background one top quark decays hadronically and the other leptonically. For all events, a cut of p T > 500 GeV is placed on the transverse momentum of the top quarks. All events are generated using MadGraph5 aMC@NLO [53] while the parton showering and hadronisation is performed with Pythia 8.2.
Using the Cambridge-Aachen algorithm [60] the hadrons and the non-isolated leptons are clustered into jets with radius R = 1.0. This is based on work using fat jets to reconstruct highly boosted top quarks [56][57][58][59]. Using FastJet [61] the k T algorithm is implemented to recluster the hardest two fat jets into jets with radius R = 0.2. Based on proximity to a B-meson, jets are b-tagged while requiring them to have a transverse momentum p T > 30 GeV. We also demand any isolated leptons to have a transverse momentum p T > 10 GeV.
The selection of these events is then based on numerous criteria. For the two fat jets in an event, one must contain at least one b-jet while the other must contain at least two light jets and one b-jet. The events must also contain a minimum of one isolated lepton and are required to have a scalar-summed transverse momentum of H T > 1 TeV.
In the following, the analysis performed is exclusively based on the transverse momentum of one b-jet (p T,b 1 ) and the event's missing energy ( / E T ). We show these observable's distributions in Figure 3 and heatmaps in Figure 4.
Our data x is normalised using min-max scaling such that x scaled ∈ [0, π]. This allows our features to be encoded as an angle in a qubit rotation when we begin training. The target labels are defined as −1 for the background set and 1 for the signal set.

Network Performance
We are comparing three models: a classic neural network trained with standard gradient descent (NN-GD), a VQC trained with standard gradient descent (VQC-GD) and a VQC trained with our quantum gradient descent method (VQC-QGD) of Sec. 3.2.
The VQC model consists of two qubits, corresponding to the two features p T,b 1 and / E T , and two layers. Each layer has a rotation gate for each qubit followed by two CNOT gates. We implement this model, depicted in Fig. 2, using PennyLane [52] and train it for 30 epochs with a batch size of 32 events and an initial learning rate of η = 0.01. During training, for all models, we reduce the learning rate value whenever the loss plateaus. However, learning rate reduction, in this instance, appears to have little effect on the performance of the network during training. The networks poor capacity to discriminate signal from background is reflective of the similarity between the two. Figure 4 shows the probability density for the events to populate areas in the feature space (p T , / E T ). The similarity between signal and background prevents the networks to benefit from a continuous learning rate reduction, for classical NNs and our hybrid method alike.
We anticipate that a significant advantage of the variational quantum classifier lies in its smaller network structure, which allows to employ computationally more expensive optimisation algorithms, as detailed in Sec. 3, giving in turn rise to a faster learning rate. Such a method would be particularly advantageous in cases where one has to train directly on a limited amount of data, e.g. rare decays or processes with small production cross section.
Thus, to compare the network's ability to learn quickly, we limit ourselves to a total of 2500 events for the signal and background samples respectively. We impose a 60-20-20 split between training-validation-test sets, i.e. we train on 1500 events. To get an understanding of the effect the size of training samples have on the model performance, we train a second set of models using only 500 events each. While we carry out the training on the PennyLane's inbuilt simulator throughout, we test their performance on the PennyLane simulator, the IBM Q simulator ¶ and IBM Q Yorktown ‖ . Accessing the IBM hardware was done through PennyLane's Qiskit plugin [62,63]. For all backends, in training and testing, we use a total of 8192 shots.
To provide a baseline we trained a classical neural network with a vanilla gradient descent optimiser. To provide a fair and instructive comparison the network has been constructed to have a similar number of trainable parameters as the variational classifier model. The network consists of one hidden layer with 3 nodes and a ReLu activation function. The rest of the hyperparameters match what was used to train the variational classifier. To implement the network we used Keras [64] with a TensorFlow backend [65].
We found that training a classical network of this size was unstable, sometimes resulting in the loss plateauing around 1 and being unable to classify the samples. To account for the instability we saw during training of the classifier we ran each model 15 times. The results presented in Figure 5 show the average loss from these runs, for each model. We see, from Figure 5, optimisation using the quantum gradient leads to a faster convergence than using the traditional gradient descent optimisation and the classical neural network.
Out of each of the three sets of 15 trained models, one was chosen that had a loss value that had converged to a point during training that was similar to the average. These models where used for testing. Figure 6 shows the ROC curve for the chosen VQC-QGC, VQC-GD and NN-GD models. Table 1 shows the performance of the quantum gradient descent method when the test data is applied to it. We see that the model, trained on the simulator, still performs well on the real hardware. In Figure 6 we see an example of the variational classifier output before the decision boundary is applied and the ROC curve for each model.

Conclusions
One of the tasks with paramount importance for searches of new physics at collider experiments is the design of methods to distinguish rare signal events from large Standard Model backgrounds. In recent years increasing effort was dedicated to developing novel machine learning methods to help find correlations in high-dimensional parameter spaces.
Harnessing the advantages found in quantum computing and combining them with classical neural networks to form a hybrid approach would provide another way to continue the improvement of these algorithms, possibly already accessible on near-term devices. Quantum machine learning is an emergent research field that aims to apply these benefits to machine learning. To explore the potential quantum advantage that could come along with quantum machine learning we propose a novel hybrid neural network, based on a variational quantum classifier. Variational quantum classifier models are in many ways analogous classical neural networks. An advantage that a VQC classifier provides over a classical neural network is its small model size. The model proposed uses a quantum algorithm equivalent of natural gradient descent. Typically, due to the need to invert large matrices, natural gradient descent is computationally prohibitive on deep neural networks. However, thanks to the model-size advantage of the VQC we can make use of quantum gradient descent to optimise our network.
Thus, we combine the use of quantum gradient descent to optimise the quantum gate parameters in the model while using classical gradient descent to optimise the classical bias term. This model was used to perform a Z resonance search. We compared the performance against a purely classical neural network and a VQC optimised with standard gradient descent. The hybrid approach proved successful in maximising the learning outcome. The hybrid approach learns faster than an equivalent classical neural network or the classically trained VQC. Even on small data samples the hybrid VQC still retains a high classification ability. While we applied this methodology to generated data we believe this approach can prove useful in data-driven classification problems where there is a small amount of data available.

A The Fubini-Study metric and the Quantum Geometric Tensor
Geometric quantum mechanics states that the traits of a quantum system can be described by geometric features on a complex projective space. In this space, there is an invariant metric tensor, the Fubini-Study tensor (FST), that can be used to describe distances between quantum states [66][67][68]. The FST can be found by taking the real part of the Quantum Geometric Tensor (QGT).
We will give a general introduction to this tensor, before briefly discussing how it can be approximated on real hardware and how it relates to our VQC. We will construct the QGT by investigating the distance between the two states |ψ 0 (θ) and |ψ 0 (θ + dθ) , where ψ is a general wave function state. We can write the probability to excite the parameter from θ to θ + dθ as ds 2 ≡ 1 − | ψ 0 (θ)|ψ 0 (θ + dθ) | 2 . (A.1) The amplitude of a state being excited from |ψ 0 (θ) to |ψ n (θ) can be written as a n = ψ n (θ + dθ)|ψ 0 (θ) , (A.2) whereas the probability for a transition between states to occur can be found by evaluating where G ij is the Quantum Geometric Tensor, defined as This tensor therefore signifies the distance between the two quantum states [69]. The Fubini-Study metric is the real part of this tensor, g ij (θ) = Re[G ij (θ)]. We can view the Fubini-Study metric as a distance measure between the wave functions, or transition probability between the states [67].
Consequently, G ij can be calculated on quantum hardware [18]. We consider a variational circuit where each layer l is parametrised by θ l and includes gates U (θ l ). These gates U and their functional relation to the Hermitian generator matrix V are described in Section 2.2 and Eq. 3.2, resulting in the relations where V i and V j are Hermitian generator matrices. From Eq. (A.5) we can find i ψ θ |∂ j ψ θ = ψ l |V j |ψ l . (A.7) By considering both A.6 and A.7 a representation of the Quantum Geometric Tensor can be formed for a block of parameters that exist in layer l G l ij = ψ l |V i V j |ψ l − ψ l |V i |ψ l ψ l |V j |ψ l . (A.8) The quantum states ψ l can be determined experimentally from the variational quantum classifier. Importantly, this approximation of the QGT also allows to find the Fubini-Study metric by taking the real part, such that g l ij = Re[G l ij ]. To calculate the inverse, we use the Moore-Penrose pseudo inverse g + = (g T g) −1 g T . (A.9) This method allows to finding an inverse matrix even if the matrix cannot be inverted, as shown in Eq A.9. In cases where the matrix is invertible the matrix pseudo inverse and matrix inverse are identical.