Skip to main content
Log in

A kernel-based quantum random forest for improved classification

  • Research
  • Published:
Quantum Machine Intelligence Aims and scope Submit manuscript

Abstract

The emergence of quantum machine learning (QML) to enhance traditional classical learning methods has seen various limitations to its realisation. There is therefore an imperative to develop quantum models with unique model hypotheses to attain expressional and computational advantage. In this work, we extend the linear quantum support vector machine (QSVM) with kernel function computed through quantum kernel estimation (QKE), to form a decision tree classifier constructed from a decision-directed acyclic graph of QSVM nodes—the ensemble of which we term the quantum random forest (QRF). To limit overfitting, we further extend the model to employ a low-rank Nyström approximation to the kernel matrix. We provide generalisation error bounds on the model and theoretical guarantees to limit errors due to finite sampling on the Nyström-QKE strategy. In doing so, we show that we can achieve lower sampling complexity when compared to QKE. We numerically illustrate the effect of varying model hyperparameters and finally demonstrate that the QRF is able to obtain superior performance over QSVMs, while also requiring fewer kernel estimations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

Data is available upon reasonable request. The QRF model Python code used to obtain results can be accessed from the following repository (Srikumar 2022).

References

Download references

Acknowledgements

The authors would like to thank Mario Kieburg for useful discussions. The large number of simulations required for this work were made feasible through access to the University of Melbourne’s High Performance Computer, Spartan (Meade et al. 2017). The code was written using Google Quantum AI’s open-source framework Cirq, and numerical results were obtained using its associated simulator.

Funding

MS is supported by the Australian Government Research Training Program (RTP) Scholarship. CDH is partially supported by the Laby Foundation research grant. We acknowledge the support provided by the University of Melbourne through the establishment of an IBM Network Quantum Hub.

Author information

Authors and Affiliations

Authors

Contributions

MS conceived the QRF approach and carried out the mathematical development, computations and analysis under the supervision of CH and LH. MS wrote the paper with input from all authors.

Corresponding author

Correspondence to Maiyuren Srikumar.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1.   Background in machine learning

This work incorporates two well-known machine learning (ML) algorithms: the support vector machine (SVM) and the random forest (RF). However, before we introduce these methods, we will first identify some of the terminology that will be used throughout this work and are common in ML literature.

We start with the concepts of supervised and unsupervised learning. Supervised models are those that take in for training a set of pairs, \(\{(\textbf{x}_i, y_i)\}_{i=1}^{N}\) where \(\textbf{x}_i \in \mathbb {R}^D\) is a D-dimensional data vector (also referred to as an instance) and \(y_i \in \mathcal {Y}\) its associated class label. A binary classification model, with \(\mathcal {Y} = \{-1, 1 \}\) labelling the two possible classes, is an example of a supervised method where previously labelled data is used to make predictions about unlabelled instances. Unsupervised models, on the other hand, involve finding patterns in data sets that do not have associate labels, \(\{\textbf{x}_i\}_{i=1}^{N}\). An example is the method of clustering data into groups which looks at finding underlying patterns that may group subsets of instances. This work, however, primarily focuses on the former and hence the remainder of the supplementary document will only refer to supervised algorithms.

The training stage of a model concerns the optimisation of internal model parameters that are algorithmically obtained. However, there are—in many cases—hyperparameters that must be selected manually prior to training. A common example is regularisation terms that force the model to behave in a certain way. They are used to ensure that the model does not overfit to the training data. The terms of under- and over-fitting are often used to describe the ways in which a model can fall short from making optimal predictions. Under-fitting occurs when the model does not have the ability to fit to the underlying pattern which may occur if for example there are not enough parameters in the model. On the other hand, overfitting is often associated with having an excess of parameters, where the model too closely optimises towards the training data set and is thereby unable generalise its predictions to instances not seen.

1.1 1.   Decision trees and random forests

In this section, we will give a brief overview of the classical random forest (RF) model and its associated hyperparameters. The classical method closely resembles the QRF proposed in the paper, diverging at the implementation of the split function. Hence, it is important that one is familiar with the classical form to understand the implementation of the QRF. The section will start with an introduction to the decision tree (DT) before constructing the RF as the ensemble of DTs. We subsequently discuss the ways in which the RF is enhanced to ensure an uncorrelated ensemble and therefore its implications to the QRF algorithm.

1.1.1 a.   Decision trees

Decision trees are supervised ML algorithms that can be employed for both classification and regression purposes. In this work, we focus on the former and give an overview of the implementation. As discussed in the main text, the DT has a directed tree structure that is traversed to obtain a classification. A data point begins at the root node and chooses a particular branch based on the outcome of a split function, f. At this point, it is important to identify the differences between the phases of training and prediction. Training starts with a group of instances at the root node while traversing down the nodes training their associated split functions to return the best splitting of the data so as to distinguish instances of different class. Prediction on the other hand, simply involves the traversal of the tree that was constructed at training. The number of splits at each node of a DT is usually selected manually and is therefore a hyperparameter of the model. However, two groups are a more natural split for the QRF, and therefore, we only observe the case where the DT has a node branching of two. This splitting process continues until a leaf condition is met. These are the three conditions that were defined in the main paper and repeated here for convenience. A node is a leaf node if any of the following apply: (i) instances supplied to the node are of the same class, (ii) the number of instances supplied is less than some user-defined value, \(m_s\), or (iii) the node is at the maximum depth of the tree, D. Condition (i) is clear as further splitting is unnecessary. Condition (ii) ensures that the splitting does not result in ultra-fine splitting with only a few instances at the leaf nodes. Such a result can often indicate that a model has overfitted to the training set. Condition (iii) is in fact a hyperparameter that we will discuss shortly.

In the main text, it was seen that the goal of the tree was to isolate classes down different branches. Hence, after each step down the tree, the model is more and more certain of the class probability distribution for a particular instance. This is seen in Fig. 4a as we see an amplification of certain class probabilities down any particular branch of the tree. As an aside, for many, this would be a reminder of Grover’s algorithm, and it is clear as to why some of the very first quantum analogues of DT involved Grover’s algorithm (Lu and Braunstein 2014).

We saw that each node amplified certain classes through certain branches, but not how this is carried out. There are various methods used in practice to form a split function, resulting in various forms of tree algorithms such as CART, ID3, CHAID and C4.5. In this paper, we use CART to compare against our quantum model.

Fig. 4
figure 4

a Amplification of class probabilities are shown through a particular path through nodes down a decision tree. Three classes are depicted with their probability distribution shown as a graph next to each node in the path. During the prediction stage of the model, an instance that concludes at the bottom orange node will be predicted to be in the maroon class. b The support vector machine (SVM) is illustrated with the model optimising for a separating hyperplane with maximum margin. It should be noted that though the SVM, as illustrated above, looks to require linearly separable data, SVMs can employ kernels and slack terms to circumvent this problem

CART algorithm

Classification and regression tree (CART) (Li et al. 1984) is one of the most commonly used DT that is able to support both numerical and categorical target variables. CART constructs a binary tree using the specific feature and threshold that attains a split with the largest information gain at each node. Mathematically, given data \(\mathcal {S}_j = \{(x_i, y_i) \}_{i=1}^{N_j}\) at a particular node j, the algorithm chooses the optimal parameter \(\theta = (l, t)\), where l is a feature and t is the threshold variable, such that the information gain (as defined in Eq. 2) is maximised:

$$\begin{aligned} \theta ^*_j = \underset{\theta }{\mathop {\mathrm {arg\,max}}\limits }\ \text {IG}\Big (\mathcal {S}_j; \mathcal {S}_j^L (\theta ), \mathcal {S}_j^R(\theta ) \Big ) \end{aligned}$$
(A1)

where \(\mathcal {S}_j^L(\theta ) \) and \(\mathcal {S}_j^R(\theta )\) are the partitioned data sets defined as

$$\begin{aligned} \mathcal {S}_j^L(\theta )&= \{(x,y) \in \mathcal {S}_j | x^{(l)} \le t\} \end{aligned}$$
(A2)
$$\begin{aligned} \mathcal {S}_j^R(\theta )&= \mathcal {S}_j \backslash \mathcal {S}_j^L(\theta ). \end{aligned}$$
(A3)

where \(x^{(l)}\) is the lth component of vector x. This maximisation is then done recursively for \(\mathcal {S}_j^L(\theta ) \) and \(\mathcal {S}_j^R(\theta )\) until a leaf condition is met. Clearly, the CART algorithm is most natural for continuous feature spaces with the geometric interpretation of slicing the feature space with hyperplanes perpendicular to feature axes. This can in fact be generalised to oblique hyperplanes—employed by perceptron decision trees (PDTs)—where the condition \(x^{(l)} \le t_j\) becomes \( x\cdot w\le t_j\), giving an optimisation over \(\theta = (w, t_j)\). This optimisation is generally carried out using gradient descent, while it is also possible to use a support vector machine (SVM). This is in fact inspiration for the quantum decision tree proposed in this work. The QDT-NQKE developed in Sect. 2 and elaborated further in Appendix 2 obtains an optimal hyperplane with a kernel-SVM employing a quantum kernel.

Aside from the main approach of the model, there are consequential hyperparameters that greatly affect the structure of tree at training. The first of which has been already set—that is the number of splits at a node. Since we are not dealing with categorical attributes in this work, setting the number of splitting at two can be compensated by increasing the allowed depth of the tree. This brings us to the next hyperparameter, the maximum depth, d, of the tree that regulates the activation of criterion (iii) for identifying a leaf node. Allowing a larger depth for the tree means that the DT becomes more expressive while at the same time becoming more prone to overfitting. This is a classic example of the bias-variance trade-off in statistical learning theory that one must consider.

1.1.2 b.   Random forests

The decision tree alone is in fact a weak learner, as it has the tendency to overfit to the trained data. In other words, it struggles with generalising predictions to instances that were not supplied during training. Understanding the problem of overfitting is crucial to all ML algorithms, and the use of regularisation techniques is required for the high performance of many models. In the case of the decision tree, overfitting is addressed by taking an ensemble (forest) of trees. An ensemble alone, however, is unlikely to provide relief from this problem. Many trees making identical decisions—as they are trained from the same data—will not decrease overfitting. Using an ensemble, one must ensure that each tree classifier is trained uniquely. This is done through injecting randomness.

It is fundamental that randomness is incorporated into the model during its training. This ensures that the classifiers are uncorrelated and hence provide the most benefit from the ensemble structure. The ways in which randomness is commonly introduced are, bagging, boosting and randomised node optimisation (RNO). Bagging is an attempt to reduce the variance of a model by generating random subsets of the training data set for each decision tree in the forest. Boosting is essentially an extension to this; however, the trees are learned sequentially with instances that are poorly predicted occurring with more frequency in subsequent trees. The idea is that instances that are harder to learn are sampled more often—hence boosted. Finally, RNO makes random restrictions on the way that a split function can be optimised. In practice, this could be the selection of only a subset of features over which to train the split function. The forest of trees with these random additions is therefore referred to as a random forest (RF). Randomness is at the heart of RFs, and hence, it is crucial that the QRF is able to inject randomness into its structure to profit from the ensemble created. We will see that this arises naturally for the QRF in Section B.

1.2 2.   Support vector machines

A support vector machine (SVM) is a supervised learning algorithm that aims to find a separating hyperplane (decision boundary) with a maximum margin between two classes of instances. Compared to other ML models, SVMs tend to perform well with comparatively small numbers of training instances and become impractical for data sets of more than a few thousand. This gives a regime in which we employ the use of a SVM, when constructing the QRF.

1.2.1 a.   Linear SVM

We now introduce the SVM starting with the most natural linear case before then showing that the method can be extended to non-linear decision boundaries with the use of kernels. A more in-depth discussion of SVMs can be found in Schölkopf and Smola (2018). Given a set of training instances from a binary concept class, \(\{(x_i, y_i) \}_{i=1}^K\) where \((x_i, y_i)\in \mathbb {R}^D \times \{-1, 1\}\), the SVM attempts to find a separating hyperplane that is defined by a vector perpendicular to it, \(w\in \mathbb {R}^D\), and a bias term, \(b\in \mathbb {R}\). Assuming that such a plane exists (requiring the data to be linearly separable), we have the condition

$$\begin{aligned} y_i (x_i \cdot w + b) > 0, \forall i=1, ...., K \end{aligned}$$
(A4)

However, here, we may observe two details. Firstly, there may exist many such hyperplanes for which this condition is satisfied, and secondly, w remains under-determined in this form, specifically its norm, ||w||. Both these problems are addressed by further requiring the following inequality:

$$\begin{aligned} y_i (x_i \cdot w + b) \ge 1, \forall i=1, ...., K \end{aligned}$$
(A5)

Clearly, Eq. A5 implies Eq. A4; however, we also have the interpretation of introducing a region on either side of the hyperplane where no data points lie. The plane-perpendicular width of this region is referred to as the margin (shown in Fig. 4b) where, from Eq. A5, it can be shown to be \(2/||w||_2\). Intuitively, we strive to have the largest margin possible, and hence, we formulate the problem as a constrained convex optimisation problem,

$$\begin{aligned} \text {minimise \ \ }&\frac{1}{2} ||w||^2_2 \nonumber \\ \text {subject to: \ \ }&y_i (x_i \cdot w + b) \ge 1, \forall i=1, ...., K \end{aligned}$$
(A6)

We can further introduce slack terms to allow the SVM to be optimised in cases with non-separable data sets,

$$\begin{aligned} \text {minimise \ \ }&\frac{1}{2} ||w||^2_2 + \frac{\lambda }{p}\sum _{i=1}^{K} \xi _i^{p} \nonumber \\ \text {subject to: \ \ }&y_i (x_i \cdot w + b) \ge 1-\xi _i, \forall i=1, ...., K \nonumber \\&\xi _i \ge 0, \forall i=1, ...., K \end{aligned}$$
(A7)

where \(\lambda > 0\) is a regularisation term that adjusts the willingness of the model to accept slack, and \(p\in \mathbb {N}\) is a constant. Most often, the L2 soft margin problem is solved with \(p=2\) resulting in a quadratic program. Hence, we can write the primal Lagrangian for the L2 soft margin program as

$$\begin{aligned}{} & {} L(w, b, \xi ; \alpha , \mu ) \nonumber \\{} & {} = \frac{1}{2} ||w||^2_2 + \frac{\lambda }{2}\sum _{i=1}^{K} \xi _i^{2} \nonumber \\{} & {} + \sum _{i=1}^{K} \Big \{ \alpha _i \Big [ 1 - \xi _i - y_i (x_i \cdot w + b)\Big ] \Big \} - \sum _{i=1}^{K} \mu _i \xi _i \end{aligned}$$
(A8)

where \(\alpha , \mu \in \mathbb {R}^K\) are dual variables of the optimisation problem, also known as Lagrange multiplier vectors. In practice, this optimisation problem is reformulated into the Lagrange dual problem, with the Lagrange dual function,

$$\begin{aligned} \mathcal {L}(\alpha , \mu )&= \inf _{w, b, \xi } L(w, b, \xi ; \alpha , \mu )\end{aligned}$$
(A9)
$$\begin{aligned}&= \sum _{i=1}^K \alpha _i - \frac{1}{2}\sum _{i, j}^K \alpha _i \alpha _j y_i y_j x_i \cdot x_j - \frac{1}{2\lambda }\sum _{i=1}^K (\alpha _i + \mu _i)^2 \end{aligned}$$
(A10)
$$\begin{aligned}&\text {\ \ \ s.t. \ \ } \sum _{i=1}^K \alpha _i y_i = 0\nonumber \end{aligned}$$
(A11)

where Eq. A10 is obtained from setting the partial derivatives of the primal Lagrangian (in Eq. A8), with respect to \(w,b,\xi \), to zero. The dual problem is subsequently defined as

$$\begin{aligned} \text {maximise \ \ }&\mathcal {L}(\alpha , \mu ) \nonumber \\ \text {subject to: \ \ }&\alpha _i \ge 0, \forall i=1, ...., K \nonumber \\&\mu _i \ge 0, \forall i=1, ...., K \end{aligned}$$
(A12)

The dual program is convex (this in fact independent of the convexity of the primal problem) with saddle point optimal solutions. However, with a convex primal problem in Eq. A7, the variables \((w^*, b^*, \xi ^*, \alpha ^*, \mu ^*)\) that satisfy the Karush-Kuhn-Tucker (KKT) conditions are both primal and dual optimal. Here, the KKT conditions are

$$\begin{aligned} w - \sum _{i=1}^K \alpha _i y_i x_i&= 0 \end{aligned}$$
(A13)
$$\begin{aligned} \sum _{i=1}^K \alpha _i y_i&= 0 \end{aligned}$$
(A14)
$$\begin{aligned} \lambda \xi _i - \alpha _i - \mu _i&= 0 \end{aligned}$$
(A15)
$$\begin{aligned} \alpha _i \big [ 1 - \xi _i - y_i (x_i \cdot w + b)\big ]&= 0 \end{aligned}$$
(A16)
$$\begin{aligned} y_i (x_i \cdot w + b) - 1 + \xi _i&\ge 0 \end{aligned}$$
(A17)
$$\begin{aligned} \mu _i \xi _i&= 0 \end{aligned}$$
(A18)
$$\begin{aligned} \mu _i&\ge 0 \end{aligned}$$
(A19)
$$\begin{aligned} \xi _i&\ge 0 \end{aligned}$$
(A20)

Solving the KKT conditions amounts to solving the problem of obtaining an optimal hyperplane. It is clear that the expression of the hyperplane in Eq. A12 allows us to write the classification function for the linear SVM as

$$\begin{aligned} h(x) = \text {sign} \left( \sum _{i=1}^N \alpha _i y_i x\cdot x_i + b \right) \end{aligned}$$
(A21)

We can however make a further adjustment by realising that Eq. A15 implies; only data points that lie on the margin have non-zero \(\alpha \). These points are referred to as support vectors and are illustrated in Fig. 4b. Hence, defining the index set of support vectors, \(\mathcal {S} \subseteq \{1,..., N\}\), we have

$$\begin{aligned} h(x) = \text {sign} \left( \sum _{s\in \mathcal {S}} \alpha _s y_s x\cdot x_s + b \right) \end{aligned}$$
(A22)

As a final note, the dual formulation allowed us to formulate the problem with training instances present only as pairwise similarity comparisons, i.e. the dot product. This allows one to generalise linear SVMs to non-linear SVMs by instead supplying an inner product on a transformed space.

1.2.2 b.   Non-linear SVM using kernels

Non-linear decision boundaries can be trained using an SVM by drawing a linear hyperplane through data embedded in a non-linearly transformed space. Let \(\phi :\mathcal {X}\rightarrow \mathcal {H}\) be a feature map such that \(x\rightarrow \phi (x)\), where \(\mathcal {H}\) is a Hilbert space with inner product \(\langle \cdot , \cdot \rangle _{\mathcal {H}}\). Hence, we have a kernel function defined by the following.

Definition 1

Let \(\mathcal {X}\) be a non-empty set, a function \(k:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {C}\) is called a kernel if there exists a \(\mathbb {C}\)-Hilbert space and a map \(\phi :\mathcal {X}\rightarrow \mathcal {H}\) such that \(\forall x_i, x_j \in \mathcal {X}\),

$$\begin{aligned} k(x_i, x_j)=\langle \phi (x_i), \phi (x_j) \rangle _{\mathcal {H}} \end{aligned}$$
(A23)

Though we intend to use quantum states that live in a complex space, the kernels used in this work will in fact map onto the real numbers. This will be made clear through the following elementary theorems (Paulsen and Raghupathi 2016).

Theorem 1

(Sum of kernels are kernels). Let \(k_1, k_2:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {C}\) be kernels. Then

$$k(x_i, x_j):= k_1(x_1, x_2) + k_2(x_1, x_2)$$

for \(x_i, x_j \in \mathcal {X}\) defines a kernel.

Theorem 2

(Product of kernels are kernels). Let \(k_1, k_2:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {C}\) be kernels. Then

$$k(x_i, x_j):= k_1(x_1, x_2) \cdot k_2(x_1, x_2)$$

for \(x_i, x_j \in \mathcal {X}\) defines a kernel.

The proofs for these theorems are quite straightforward and can be found in Christmann and Steinwart (2008). These theorems allow one to construct more complex kernels from simpler ones, and as alluded to earlier, they allow us to understand the transformation from an inner product on the complex field to one on the real. Let \(k':\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {C}\) and \(k'':=k'^*\) be kernels, where \(^*\) denotes the complex conjugate. From Theorem 2, we are able to define a kernel, \(k: \mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}\) such that \(k(x_i, x_j) = k'(x_1, x_2) \cdot k''(x_1, x_2) = |k'(x_1, x_2)|^2\). In Section B, we will see that this is how we construct the quantum kernel: as the fidelity between two feature embedded quantum states, \(k(x_i, x_j) = |\langle \phi (x_j) | \phi (x_j) \rangle |^2\).

The kernel function is clearly conjugate symmetric due to the axioms of an inner product. It is important to note that given a feature map, the kernel is unique. However, the reverse does not hold true: a kernel does not have a unique feature map. Furthermore, we will see that one does not need to define a feature map to be able to claim having a kernel. For this, we need the concept of positive definteness.

Definition 2

A symmetric function, \(k: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}\) is positive definite if for all \(a_i \in \mathbb {R}\), \(x_i \in \mathcal {X}\) and \(N\ge 1\),

$$\begin{aligned} \sum _{i=1}^N \sum _{j=1}^N a_i a_j k(x_i, x_j) \ge 0 \end{aligned}$$
(A24)

All inner products are positive definite, which clearly implies the positive definiteness of the kernel in Eq. A22. Interestingly, this is in fact an equivalence, with a positive definite function guaranteed to be an inner product on a Hilbert space \(\mathcal {H}\) with an appropriate map \(\phi :\mathcal {X} \rightarrow \mathcal {H}\) (Christmann and Steinwart 2008).

Theorem 3

(Symmetric, positive definite functions are kernels). A function \(k:\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}\) is a kernel if and only if it is symmetric and positive definite.

This theorem allows us to generate kernels without requiring the specific feature map that generated the kernel. Furthermore, we are able to compute a kernel that may have a feature map that is computationally infeasible. This concept is crucial to the emergence of kernel methods and will be the main inspiration for quantum kernel estimation.

Returning to the issue of non-linear decision boundaries, we observe that the linear SVM, from Section A 2 a, can be extended by simply generalising the inner product on Euclidean space with a kernel function. This is known as the kernel trick, where data is now represented as pairwise similarity comparisons that may be associated with an embedding in a higher dimensional (or even infinite dimensional) space. We therefore have the following dual program:

$$\begin{aligned} \text {maximise \ \ } \sum _{i=1}^N \alpha _i&- \frac{1}{2}\sum _{i, j}^N \alpha _i \alpha _j y_i y_j k(x_j, x_j) \nonumber \\&- \frac{1}{2\lambda }\sum _{i=1}^N (\alpha _i + \mu _i)^2 \end{aligned}$$
(A25)
$$\begin{aligned} \text {subject to: \ \ }&\alpha _i \ge 0, \forall i=1, ...., K \end{aligned}$$
(A26)
$$\begin{aligned}&\mu _i \ge 0, \forall i=1, ...., N \end{aligned}$$
(A27)
$$\begin{aligned}&\sum _{i=1}^K \alpha _i y_i = 0 \end{aligned}$$
(A28)

The classification function is also quite similar with the replacement of the kernel for the dot product. However, we will obtain its form from the discussion of reproducing kernels and their associated reproducing kernel Hilbert spaces.

1.2.3 c.   Reproducing Kernel Hilbert spaces and the representer theorem

The concept of reproducing kernel Hilbert spaces (RKHS) is invaluable to the field of statistical learning theory, as they accommodate the well-known Representer theorem. This will give us a new perspective on the derivation of the SVM optimisation. Furthermore, it will provide a guarantee that an optimal classification function will be composed of weighted sums of kernel functionals over the training set.

Definition 3

Let \(\mathcal {F}\subset \mathbb {C}^{\mathcal {X}}\) be a set of functions forming a Hilbert space with inner product \(\langle \cdot , \cdot \rangle _{\mathcal {F}}\) and norm \(||f||=\langle f, f\rangle _{\mathcal {F}}^{\frac{1}{2}}\) where \(f\in \mathcal {F}\). A function \(\kappa :\mathcal {X} \times \mathcal {X} \rightarrow \mathbb {F}\), for some field \(\mathbb {F}\), is called a reproducing kernel of \(\mathcal {F}\) provided that,

  1. (i)

    \(\forall x \in \mathcal {X}\), \(\kappa _x (\cdot ):= \kappa (\cdot , x) \in \mathcal {F}\), and,

  2. (ii)

    \(\forall x \in \mathcal {X}\), \(\forall f\in \mathcal {F}\), \(\langle f, \kappa (\cdot , x) \rangle _{\mathcal {F}} = f(x)\) (reproducing property)

are satisfied. The Hilbert space, \(\mathcal {F}\)—for which there exists such a reproducing kernel—is therefore referred to as a reproducing kernel Hilbert space.

It is important to note that the definition of a kernel is not explicitly stated in Definition 3. Rather, we see that for any \(x_i,x_j\in \mathcal {X}\), we have the following property of the reproducing kernel:

$$\begin{aligned} \kappa (x_i, x_j) = \langle \kappa (\cdot , x_i), \kappa (\cdot , x_j) \rangle _\mathcal {H} \end{aligned}$$
(A29)

When compared with Eq. A22, we see that we have a feature map of the form, \(\phi (x) = \kappa (\cdot , x)\). Therefore it is evident that the reproducing property implies that \(\kappa \) is a kernel as per Definition 1. We also have the reverse:

Theorem 4

(Moore-Aronszajn (Aronszajn 1950)). For every non-empty set \(\mathcal {X}\), a function \(k:\mathcal {X}\times \mathcal {X} \rightarrow \mathbb {R}\) is positive definite if and only if it is a reproducing kernel.

This implies that every positive definite kernel is associated with a unique RKHS. This has implications for the Nyström approximated quantum kernel. However, we leave reconstructing the associated RKHS for future work.

To understand the significance of the representer theorem, we state the general problem kernel methods attempt to solve—this includes the non-linear SVM. Given some loss function \(\mathscr {L}\) that quantifies the error of a learning model against labelled training data \(\{(\textbf{x}_i, y_i)\}_{i=1}^{N}\), we aim to find an optimal function \(f^*\) such that

$$\begin{aligned} f^* = \underset{f\in \mathcal {F}}{\mathop {\mathrm {arg\,min}}\limits } \frac{1}{N} \sum _{i=1}^K \mathscr {L}(y_i, f(x_i)) + \lambda ||f||^2_{\mathcal {F}} \end{aligned}$$
(A30)

where \(\lambda \ge 0\) and \(\mathcal {F}\) is the RKHS with reproducing kernel k. This optimisation problem is quite general and further encompasses algorithms such as kernel ridge regression (Vovk 2013). However, it is not clear that such an optimisation is efficiently computable. We will see that the representer theorem will allow us to simplify the problem so that the whole space of functions need not be searched in order to find the optimal \(f^*\).

Theorem 5

(The representer theorem (Kimeldorf and Wahba 1971; Schölkopf et al. 2001)). Let \(\mathcal {F}\) be a RKHS with an associated reproducing kernel, k. Given a set of labelled points \(\{(\textbf{x}_i, y_i)\}_{i=1}^{N}\) with loss function \(\mathscr {L}\) and a strictly monotonically increasing regularisation function, \(P:\mathbb {R}^{+}_0 \rightarrow \mathbb {R}\), we consider the following optimisation problem:

$$\begin{aligned} \min _{f\in \mathcal {F}} \frac{1}{N} \sum _{i=1}^N \mathscr {L}(y_i, f(x_i)) + P(||f||_{\mathcal {F}}) \end{aligned}$$
(A31)

Any function \(f^* \in \mathcal {F}\) that minimises Eq. A30 can be written as

$$\begin{aligned} f^* = \sum _{i=1}^N \alpha _i k(\cdot , x_i) \end{aligned}$$
(A32)

where \(\alpha _i \in \mathbb {R}\).

The significance of this theorem comes from the fact that solutions to kernel methods with high dimensional (or even infinite) functionals are restricted to a subspace spanned by the representers of the data, thereby reducing the optimisation of functionals to optimising scalar coefficients.

Now, coming back to SVMs, the optimal classification function, \(f^*\) is the solution of the following:

$$\begin{aligned} W^* = \underset{W\in \mathcal {F}}{\mathop {\mathrm {arg\,min}}\limits } \frac{1}{N} \sum _{i=1}^N \max (0, W(x_i)y_i) + \frac{\lambda }{2} ||W||^2_{\mathcal {F}} \end{aligned}$$
(A33)

where we will see \(W^*:\mathcal {X} \rightarrow \mathbb {R}\) corresponds to the optimal separating hyperplane. Now, using the representer theorem, we let \(W = \sum _{i=1}^N \alpha _i k(\cdot , x_i)\) and substitute into Eq. A32 to obtain the program derived in Eq. A24. Though the approach to the optimisation of the SVM in this section is different to the maximisation of the hyperplane margin illustrated earlier, we are required to solve the same convex optimisation problem.

1.3 3.   Generalisation error bounds

The most important characteristic of a machine learning model is that it is able to generalise current observations to predict outcome values for previously unseen data. This is quantified, though, as what is known as the generalisation error of a model.

Definition 4

(Generalisation error) Given a hypothesis \(h\in \mathbb {H}\) a target concept \(c\in \mathcal {C}\) and an underlying distribution D, the generalisation error (or risk) of h is defined as

$$\begin{aligned} \mathcal {R}(h) = \Pr _{x\sim \mathcal {D}} [h(x) \ne c(x)] \end{aligned}$$
(A34)

However, both the underlying distribution \(\mathcal {D}\) of the data and the target concept c are not known, and hence, we have empirical error across the testing set.

Definition 5

(Empirical error) Given a hypothesis \(h\in \mathbb {H}\) and samples \(S=(z_1,..., z_N)\), where \(z_i=(x_i, y_i)\), the empirical error of h is defined as

$$\begin{aligned} \widehat{\mathcal {R}}(h) = \frac{1}{N}\sum _{i=1}^N \textbf{1}_{h(x_i)\ne y_i} \end{aligned}$$
(A35)

where \(\textbf{1}_a\) is the indication function of the event a, 1 when a is true and 0 otherwise.

The aim of this section is to provide theoretical bounds on the generalisation error, with the empirical error being depicted in numerical results. To provide these bounds, we must first distinguish the strength of models. There exists a range of tools in statistical learning theory to quantify the richness of models. One such measure is the Rademacher complexity that measures the degree to which a model—defined by its hypothesis set—can fit random noise. Note that this is independent of the trainability of the model.

Definition 6

(Empirical Rademacher complexity) Given a loss function \(\Gamma :\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}\) and a hypothesis set \(\mathbb {H}\), let \(\mathcal {S} =((x_1, y_1),..., (x_N, y_N))\) be a fixed sample set of size N, and \(\mathcal {G}=\{g:\mathcal {X}\times \mathcal {Y}\rightarrow \mathbb {R}| g(x,y)=\Gamma (h(x), y), h\in \mathbb {H}\}\) be a family of functions. The empirical Rademacher complexity of \(\mathcal {G}\) with respect to the sample set \(\mathcal {S}\) is defined as

$$\begin{aligned} \widetilde{\mathfrak {R}}_{\mathcal {S}} (\mathcal {G}) = \mathop {\mathbb {E}}_{\sigma } \Bigg [ \mathop {\text {sup}}_{g\in \mathcal {G}}\frac{1}{N} \sum _{i=1}^N \sigma _i g(x_i, y_i)\Bigg ] \end{aligned}$$
(A36)

where \(\sigma _i \in \{-1, +1\}\) are independent uniform random variables known as Rademacher variables.

Definition 7

(Rademacher complexity) Let samples be drawn from some underlying distribution \(\mathcal {D}\), such that \(\mathcal {S}\sim \mathcal {D}^N\). For \(N\ge 1\), the Rademacher complexity of \(\mathcal {G}\) is defined as the expectation of the empirical Rademacher complexity over all samples of size N, each drawn from \(\mathcal {D}\), i.e.

$$\begin{aligned} \mathfrak {R}_N (\mathcal {G}) = \mathop {\mathbb {E}}_{\mathcal {S}\sim \mathcal {D}^N} \big [\widetilde{\mathfrak {R}}_{\mathcal {S}} (\mathcal {G}) \big ] \end{aligned}$$
(A37)

The Rademacher complexity now allows us to state a well-known theorem in statistical learning theory that provides an upper bound to the expectation value of function \(g\in \mathcal {G}\).

Theorem 6

(Theorem 3.1, (Mohri and Rostamizadeh 2019)) Let \(\mathcal {G}\) be a family of functions mapping from \(\mathcal {X}\times \mathcal {Y}\) to [0, 1]. Then for any \(\delta >0\) and for all \(g\in \mathcal {G}\), with probability of at least \(1-\delta \) we have,

$$\begin{aligned} \mathbb {E}_{(x,y)\sim \mathcal {D}}[g(x, y)] \le \frac{1}{N} \sum _{i=1}^N g(x_i, y_i) + 2 \widetilde{\mathfrak {R}}_{\mathcal {S}} (\mathcal {G}) + 3 \sqrt{\frac{\log (1/\delta )}{2N}} \end{aligned}$$
(A38)

We see from Theorem 6 that a model that is more expressive, i.e. has a greater \(\widetilde{\mathfrak {R}}_{\mathcal {S}}\), is in fact detrimental to bounding the loss on unseen data points. This supports the observation of overfitting to training instances that most often occur when a model is over-parameterised.

The theory of generalisation error discussed in this section will become important when we attempt to quantify the performance of the QRF and its constituent parts. It is however important to note that the Rademacher complexity is only one particular way of attaining a generalisation bound. We will later come across sharper bounds that can be obtained using margin-based methods in the context of separating hyperplanes.

1.4 4.   Nyström method for kernel-based learning

The elegance of kernel methods are greatly limited by the \(\mathcal {O}(N^2)\) operations required to compute and store the kernel matrix, where N is the number of training instances. In the age of big data, this greatly reduces the practicality of kernel methods for many large-scale applications. Furthermore, the number of kernel element computations becomes crucial in quantum kernel estimation, where quantum resources are far more limited. However, there have been solutions to reduce this complexity that have had varying degrees of success (Peters et al. 2021). In this work, we approximate the kernel matrix using the Nyström method that requires only a subset of the matrix to be sampled. This approximation is predicated on the manifold hypothesis that in statistical learning refers to the phenomena of high dimensional real-world data sets usually lying in low-dimensional manifolds. The result is that the rank of the kernel matrix is often far smaller than N, in which case an approximation using a subset of data points is a plausible, effective solution.

The Nyström method was introduced in the context of integral equations with the form of an eigen equation (Williams and Seeger 2001),

$$\begin{aligned} \int k(y, x) \phi _i (x) p(x) dx = \lambda _i \phi _i (y) \end{aligned}$$
(A39)

where p(x) is the probability density function of the input x, k is the symmetric positive semi-definite kernel function, \(\{\lambda _1, \lambda _2,...\}\) denote the non-increasing eigenvalues of the associated p-orthogonal eigenvectors \(\{\phi _1, \phi _2,...\}\), i.e. \(\int \phi _i(x)\phi _j(x)p(x)dx=\delta _{ij}\). Equation A38 can be approximated given i.i.d. samples \(\mathcal {X} = \{x_j\}_{j=1}^N\) from p(x) by replacing the integral with the sum as the empirical average

$$\begin{aligned} \frac{1}{N} \sum _{j=1}^N k(y, x_j) \phi _i (x_j) \approx \lambda _i \phi _i (y) \end{aligned}$$
(A40)

This has the form of a matrix eigenvalue problem:

$$\begin{aligned} K^{(N)} \Phi ^{(N)} = \Phi ^{(N)} \Lambda ^{(N)} \end{aligned}$$
(A41)

where \(K_{ij}^{(N)}=k(x_i, x_j)\) for \(i,j=1,...,N\) is the kernel (Gram) matrix, \(\Phi _{ij}^{(N)} \approx \frac{1}{\sqrt{N}}\phi _j(x_i)\) containing the eigenvectors and the diagonal matrix \(\Lambda _{ii}^{(N)} \approx N\lambda _i\) containing the eigenvalues. Therefore, solving this matrix equation will subsequently give approximations to the initially sought-after eigenvector \(\phi _i\) in Eq. A38,

$$\begin{aligned} \phi _i(y) \approx \frac{\sqrt{N}}{\Lambda _{ii}^{(N)}} \sum _{j=1}^N k(y, x_j) \Phi ^{(N)}_{ji} \end{aligned}$$
(A42)

The effectiveness of this approximation is determined by the number of samples, \(\mathcal {X}=\{x_i\}_{i=1}^N\)—with a greater number q resulting in a better approximation. Applying similar reasoning, a subset \(\mathcal {Z}=\{z_i\}_{i=1}^{L} \subset \mathcal {X}\) with \(L< N\) points can be used to approximate the eigenvalue problem in Eq. A40. This approximation is precisely the Nyström method, with the points in set \(\mathcal {Z}\) referred to as landmark points. More specifically, given the eigen system of the full kernel matrix, \(K \Phi _{\mathcal {X}} = \Phi _{\mathcal {X}} \Lambda _{\mathcal {X}}\) with \(K_{ij}=k(x_i, x_j)\) and equivalently \(W \Phi _{\mathcal {Z}} = \Phi _{\mathcal {Z}} \Lambda _{\mathcal {Z}}\) with \(W_{ij}=k(z_i, z_j)\), we make the following approximation (Williams and Seeger 2001):

$$\begin{aligned} \Phi _{\mathcal {X}} \approx \sqrt{\frac{L}{N}} E \Phi _{\mathcal {Z}}\Lambda _{\mathcal {Z}}^{-1} \text { , \ \ } \Lambda _{\mathcal {X}} \approx \frac{N}{L} \Lambda _{\mathcal {Z}} \end{aligned}$$
(A43)

where \(E\in \mathbb {R}^{N\times L}\) with \(E_{ij} = k(x_i, z_j)\) assuming without loss of generality \(E:=[W,B]^\top \), \(B\in \mathbb {R}^{(N-L)\times L}\). Combining \(K= \Phi _{\mathcal {X}} \Lambda _{\mathcal {X}} \Phi _{\mathcal {X}}^\top \) with the approximation in Eq. A42, we find

$$\begin{aligned} K&\approx \Bigg (\sqrt{\frac{L}{N}} E \Phi _{\mathcal {Z}}\Lambda _{\mathcal {Z}}^{-1}\Bigg ) \Big ( \frac{N}{L} \Lambda _{\mathcal {Z}} \Big ) \Bigg ( \sqrt{\frac{L}{N}} E \Phi _{\mathcal {Z}}\Lambda _{\mathcal {Z}}^{-1}\Bigg )^\top \end{aligned}$$
(A44)
$$\begin{aligned}&= EW^{-1}E^\top \end{aligned}$$
(A45)

where \(W^{-1}\) is the pseudo-inverse of W. In practice, one usually takes the best r-rank approximation of W with respect to the spectral or Frobenius norm prior to taking the pseudo-inverse, i.e. \(W^{-1}_r = \sum _{i=1}^r \lambda _i^{-1} u_i u_i^\top \) where \(r\le L\) with orthonormal eigenvectors \(\{u_i\}_{i=1}^L\) and associated non-increasing eigenvalues \(\{\lambda _i\}_{i=1}^L\) of matrix W. This will later ensure that we can avoid problems of \(\left\Vert W^{-1}\right\Vert _2\) becoming unbounded as arbitrarily small eigenvalues of W are omitted. Now, expanding matrix E in Eq. A44, we have the Nyström approximation given by

$$\begin{aligned} K \approx \widehat{K} := \begin{bmatrix} W &{} B \\ B^\top &{} B^\top W^{-1} B \end{bmatrix}. \end{aligned}$$
(A46)

In other words, by computing only the L columns \(E=[W, B]^\top \in \mathbb {R}^{N\times L}\), we are able to approximate the full \(K \in \mathbb {R}^{N\times N}\) Gram matrix required for use in kernel methods. In practice, for its application in SVMs, we are able to interpret this approximation as a map. To see this, we expand the approximation in Eq. A44:

$$\begin{aligned} \widehat{\Phi }^\top \widehat{\Phi } := \widehat{K}&= EW^{-1}E^\top \end{aligned}$$
(A47)
$$\begin{aligned}&= E W^{-\frac{1}{2}} W^{-\frac{1}{2}} E^\top \end{aligned}$$
(A48)
$$\begin{aligned}&=\Big (E W^{-\frac{1}{2}} \Big ) \Big (E W^{-\frac{1}{2}} \Big )^\top \end{aligned}$$
(A49)

where we use the fact that \((W^{-\frac{1}{2}})^\top = W^{-\frac{1}{2}}\). Hence, we have an approximated kernel space, \(\widehat{\Phi }=\Big (E W^{-\frac{1}{2}} \Big )^\top \), i.e. given a point x, the associated vector in kernel space is \(\widehat{x}=(vW^{-1/2})^\top \in \mathbb {R}^{L}\) where \(v=[k(x, z_1),..., k(x, z_{L})]\). To make this clear, we explicitly define the Nyström feature map (NFM), \(\textbf{N}_W:\mathcal {X}\rightarrow \mathbb {R}^{L}\), such that

$$\begin{aligned} \Big (\textbf{N}_W (x)\Big )_i = \sum _{j=1}^{L} k(x, z_j) (W^{-1/2})_{ij} \end{aligned}$$
(A50)

where \(i=1,...,L\) and \(\{z_i\}_{i=1}^{L}\) are the landmark points of the approximation. The set \(\widehat{\mathcal {X}}=\{ \textbf{N}_W (x): x \in \mathcal {X} \}\) is now used as the training set for fitting a linear SVM. Therefore, employing Nyström approximation for kernel SVMs is identical to transforming the training data to the kernel space prior to the training of the SVM.

It should be noted that the performance of the approximation is significantly influenced by the specific landmark points \(\{z_i\}_{i=1}^{L}\) selected—especially since we generally have \(L<<N\). Consequently, there has been a great deal of work exploring strategies to sample sets of optimal landmark points. Apart from uniform sampling from the data set, it was suggested in Drineas and Mahoney (2005) that the ith column be sampled with weight proportional to its diagonal element, \(k(x_i, x_i)\). In the case of quantum kernel estimation, where the diagonal entries are always 1, both techniques are therefore identical. Furthermore, there exist more complex schemes based on using k-means clustering to sample columns (Zhang and Kwok 2010) and greedy algorithms to sample based on the feature space distance between a potential column and the span of previously chosen columns. In this work, we randomly sample columns, and it is left for future work to determine whether there are any advantages to doing otherwise with regard to the quantum case.

Computationally, the Nyström method requires performing a singular value decomposition (SVD) on W that has a complexity of \(\mathcal {O}(L^3)\). In addition to the matrix multiplication required, the overall computational complexity is \(\mathcal {O}(L^3 + NL^2)\) while the space complexity is only \(\mathcal {O}(L^2)\). For the case where \(L<<N\), this is a significant improvement to the \(\mathcal {O}(N^2)\) space and time complexity required without Nyström. Nonetheless, this method is only an approximation, and hence, it is useful to provide bounds to its error.

1.4.1 a.   Error bounds to low-rank Nyström approximation

The Nyström method is most effective in cases where the full kernel matrix K has a low rank. In such cases, the approximation (i) does not depend heavily on the specific columns chosen, and (ii) is able to faithfully project to a lower dimensional space to form a low-rank matrix that does not differ greatly from the real matrix. However, the rank of a specific kernel matrix is dependent on both the training set as well as the specific kernel function, meaning that a rank-dependent bound would be impractical. We therefore provide bounds on the error of the Nyström approximation through the spectral norm of difference between approximated and actual kernel matrices, \(\left\Vert K-\widehat{K}\right\Vert _2\). It should be noted that, empirically, it is generally observed that as you increase the rank, the relative prediction error decays far more quickly than the error in matrix approximation (Bach 2013). This suggests that bounds supplied based on the matrix norms may not necessarily be optimal if the goal is maximising the overall performance of the model. Nonetheless, we provide the following bound to the Nyström approximation error that will become important for bounding the generalisation error of our quantum model.

Theorem 7

(Theorem 3, (Drineas and Mahoney 2005)) Let K be an \(N\times N\) symmetric positive semi-definite matrix, with Nyström approximation \(\widehat{K}=EW^{-1}E^\top \) constructed by sampling L columns of K with probabilities \(\{p_i\}_{i=1}^N\) such that \(p_i =K_{ii}/\sum _{j=1}^N K_{jj}\). In addition, let \(\epsilon >0\) and \(\eta =1+\sqrt{8\log (1/\delta )}\). If \(L\ge 4\eta ^2/\epsilon ^2\) then with probability at least \(1-\delta \),

$$\begin{aligned} \left\Vert K-\widehat{K}\right\Vert _2 \le \epsilon \sum _{i=1}^N K_{ii}^2 \end{aligned}$$
(A51)

Following this theorem, we have the subsequent corollary that bounds the error in the kernel matrix in the case where the diagonal matrix entries are equal to 1. This special case is true for quantum kernels.

Corollary 1

Given \(K_{ii}=1\) for all \(i=1,...,N\), with probability at least \(1-\delta \) we have

$$\begin{aligned} \left\Vert K-\widehat{K}\right\Vert _2 \le \frac{N}{\sqrt{L}}\Big ( 1+ \sqrt{8 \log (1/\delta )}\Big ) = \mathcal {O}\Bigg (\frac{N}{\sqrt{L}}\Bigg ). \end{aligned}$$
(B1)

Proof

This follows trivially from choosing \(L=4\eta ^2 / \epsilon ^2\). \(\square \)

The bounds in Theorem 7 and Corollary 1 are clearly not sharp. There have been subsequent works providing sharper bounds based on the spectrum properties of the kernel matrix, such as in Jin et al. (2013) where the authors improve the upper bound to \(\mathcal {O}(N/L^{1-p})\), where p is the p-power law of the spectrum. In other words, they show that the bound can be improved by observing the decay of the eigenvalues, with a faster decay giving a stronger bound.

Appendix 2.   The quantum random forest

This section will elaborate upon the inner workings of the QRF beyond the overview discussed in the main text. This includes the algorithmic structure of the QRF training process, the quantum embedding, datasets used and relabelling strategies employed. We also supply proofs for the error bounds claimed in the main text.

The QRF is an ensemble method that averages the hypotheses of many DTs. To obtain an intuition for the performance of the averaged ensemble hypothesis, we observe the L2 risk in the binary learning case and define the ensemble hypothesis as \(H:= \mathbb {E}_{\textbf{Q}} [h_{\textbf{Q}}]\), where \(h_{\textbf{Q}}\) is the hypothesis for each DT. We observe the following theorem.

Theorem 8

(L2-risk for averaged hypotheses). Consider \(\textbf{Q}\) as a random variable to which a hypothesis \(h_{\textbf{Q}}: \mathcal {X} \rightarrow \{-1,+1\}\) is defined. Defining the averaged hypothesis as \(H:= \mathbb {E}_{\textbf{Q}} [h_{\textbf{Q}}]\) and \(\text {Var}[h_{\textbf{Q}}]:= \mathbb {E}_{\textbf{Q}}[(h_{\textbf{Q}}(x) - H(x))^2]\), the L2 risk, R, satisfies,

$$\begin{aligned} R(H) = \mathbb {E}_{\textbf{Q}} [R(h_{\textbf{Q}})] - \mathbb {E}_{X}[\text {Var}[h_{\textbf{Q}}(X)]] \end{aligned}$$
(B2)

Proof

Derivation can be found on pg. 62 of Wolf (2020).

The theorem highlights the fact that highly correlated trees do not take advantage of the other trees in the forest with the overall model performing similarly to having a single classifier. Alternatively, obtaining an ensemble of trees that are only slightly correlated—hence with larger \(\text {Var}[h_{\textbf{Q}}] \)—is seen to give a smaller error on unseen data (risk). Note, we are also not looking for uncorrelated classifiers, as this would imply that the classifiers are always contradicting one another, resulting in the first term on the RHS of Eq. B1 becoming large. Interestingly, one can also observe the phenomena expressed in Theorem 8 by observing the variance of \(\mathbb {E}_{\textbf{Q}} [h_{\textbf{Q}}]=\frac{1}{T}\sum _{t=1}^T \textbf{Q}_t(\textbf{x}; c)\) for T trees and obtaining a result that is explicitly dependent on the correlation between classifiers.

1.1 1.   Algorithmic structure of training the QRF

Algorithm 1
figure a

Training of the \(i^{\text {th}}\) split node

In this section, we provide a step-by-step description of the QRF training algorithm that should complement the explanation given in the main text. For simplicity, we break the training algorithm into two parts: (i) the sub-algorithm of split function optimisation and (ii) the overall QRF training. We state the former in Algorithm 1.

The implementation of the SVM was using the sklearn Python library that allowed for a custom kernel as an input function. Having laid out the steps for the optimisation of a split node, we can finally describe the QRF training algorithm in Algorithm 2. This will incorporate Algorithm 1 as a sub-routine. Example code for the QRF model can be found at the following repository (Srikumar 2022).

1.2 2.   Embedding

In this section, we talk about the specifics of the PQC architectures used for embedding classical data vectors into a quantum feature space. The importance of appropriate embeddings cannot be understated when dealing with quantum kernels. This is evident when observing the bound on the generalisation error of quantum kernels that is seen in Huang et al. (2021). However, as we focus on relabelled instances in this work—discussed further in Section B 6 b—we do not compare embeddings and their relative performance on various datasets. Instead, the embeddings provide a check that numeric observations are observed, even with a change in embedding.

This work looks at two commonly analysed embeddings:

  1. (i)

    Instantaneous-Quantum-Polynomial (IQP) inspired embedding, denoted as \(\Phi _{\text {IQP}}\). This embedding was proposed in Havlíček et al. (2019) and is commonly employed as an embedding that is expected to be classically intractable.

    $$\begin{aligned} |\Phi _{\text {IQP}} (x_i)\rangle = \mathcal {U}_Z (x_i) H^{\otimes n} \mathcal {U}_Z (x_i) H^{\otimes n} \, |0\rangle ^{\otimes n} \end{aligned}$$
    (B3)
    Algorithm 2
    figure b

    Training of the QRF classifier

    where \(H^{\otimes n}\) is the application of Hadamard gates on all qubits in parallel, and \(\mathcal {U}_Z (x_i)\) defined as

    $$\begin{aligned} \mathcal {U}_Z (x_i) = \exp \left( \sum _{j=1}^n x_{i,j}Z_j + \sum _{j=1}^n \sum _{k=1}^n x_{i,j}x_{i,k} Z_j Z_k \right) \end{aligned}$$
    (B4)

    where \(x_{i,j}\) denotes the \(j^{\text {th}}\) element of the vector \(x_i \in \mathbb {R}^{D}\). This form of the IQP embedding is identical to that of Huang et al. (2021) and equivalent to the form presented in Havlíček et al. (2019). One should note, in this work, each feature in the feature space is mean-centred and has a standard deviation of 1—see Section B 6 a for the pre-processing of datasets.

  2. (ii)

    Hardware-Efficient-Anzatz (HEA) style embedding, denoted as \(\Phi _{\text {Eff}}\), respectively. The anzatz is of the form of alternating transverse rotation layers and entangling layers—where the entangling layers are CNOT gates applied on nearest-neighbour qubits. Mathematically, it has the form

    $$\begin{aligned} |\Phi _{\text {Eff}} (x_i)\rangle= & {} \prod _{l=0}^{\mathbf {L-1}} \Bigg (\text {E}_n \mathcal {R}^Z_n\Big (\{x_{i,k} | k = 2nl + n + j \text { mod } D\}_{j=0}^{n-1}\Big ) \nonumber \\{} & {} \text {E}_n \mathcal {R}^Y_n\Big (\{x_{i,k} | k \!=\! 2nl + j \text { mod } D\}_{j=0}^{n-1}\Big ) \Bigg ) \, |0\rangle ^{\otimes n} \end{aligned}$$
    (B5)

    where \(\mathcal {R}^A_n (\{\theta _i\}_{i=0}^{n-1}) = \prod _{i=0}^{n-1}\exp (-i A_j \theta _i/2)\) with \(A_j\) defined as Hermitian operator A applied to qubit j, defining the entanglement layer as \(\text {E}_n = \prod _{j=1}^{\lfloor n/2 \rfloor } \text {CNOT}_{2j,2j+1} \prod _{j=1}^{\lfloor n/2 \rfloor } \text {CNOT}_{2j-1,2j} \), D is the dimension of the feature space and we define \(\prod _{i=1}^N A_i:= A_N... A_1\). In this work, we take \(\textbf{L}=n\) such that the number of layers scales with the number of qubits. Though this embedding may seem quite complex, the form in which classical vectors are embedded into a quantum state is quite trivial, as we are simply sequentially filling all rotational parameters in the anzatz with each entry of the vector \(x_i\)—repeating entries when required.

Fig. 5
figure 5

The importance of selecting an appropriate penalty parameter, C is shown above. One should note \(C\propto 1/\lambda \) from Eq. A7, with a larger C preferring greater accuracy over larger margins. The original Fashion MNIST dataset was used with split function parameters \((T, M, L)=(1, 2048, 10)\). Training was done by sampling 180 instances and testing with separate 120 instances. These numeric simulations indicate that the dimension of the dataset does also affect where optimal C occurs. It is therefore crucial that C is optimised for any given problem

Though both embeddings induce a kernel function that outputs a value in the range [0, 1], the regularisation term C can be quite sensitive to the embedding as it is generally a function of the type of embedding (including number of qubits) and the number of data points used for training. This is illustrated in Fig. 5 where we see that \(\Phi _{\text {Eff}}\) requires a higher C. We posit that the increase of C down the tree allows for wider margins at the top of the tree—avoiding overfitting—with finer details distinguished towards the leaves. Demonstration of this is left for future work.

1.3 3.   Multiclass classification

A crucial decision must be made when dealing with multi-class problems. The complication arises in designating labels to train the binary split function at each node. More specifically, given the original set of classes \(\mathcal {C}\), we must determine an approach to find optimal partitions, \(\mathcal {C}^{(i)}_{-1}\cup \mathcal {C}^{(i)}_{+1} = \mathcal {C}^{(i)}\) where \(\mathcal {C}^{(i)}_{-1} \cap \mathcal {C}^{(i)}_{+1}=\emptyset \) and \(\mathcal {C}^{(i)}\) indicates the set of classes present at node i. These sets are referred to as pseudo class partitions as they define the class labels for a particular split function. The two pseudo class partitions will therefore define the branch down which instances will flow. We consider two strategies for obtaining partitions of classes given a set of classes \(\mathcal {C}^{(i)}\) at node i:

  1. (i)

    One-against-all (OAA): Randomly select \(c \in \mathcal {C}^{(i)}\) and define the partitions to be \(\mathcal {C}_{-1}^{(i)}= \{c\}\) and \(\mathcal {C}_{+1}^{(i)}= \mathcal {C}^{(i)}\backslash \{c\}\).

  2. (ii)

    Even-split (ES): Randomly construct pseudo class \(\mathcal {C}_{-1}^{(i)}\!=\!\{ c_i\}_{i=1}^{\lceil |\mathcal {C}^{(i)}|/2 \rceil } \subset \mathcal {C}^{(i)}\) and subsequently define \(\mathcal {C}_{+1}^{(i)}= \mathcal {C}^{(i)}\backslash \mathcal {C}_{-1}^{(i)}\). This has the interpretation of splitting the class set into two.

Though both methods are valid, the latter is numerically observed to have superior performance—seen in Fig. 6. The two strategies are identical when \(|\mathcal {C}|\le 3\) and the performance diverges as the size of the set of classes increases. We can therefore introduce associated class maps, \(F_{\text {OAA}}^{(i)}\) and \(F_{\text {ES}}^{(i)}\) for a node i, defined as

$$\begin{aligned} y \rightarrow \widehat{y} = F^{(i)}(y) := {\left\{ \begin{array}{ll} -1,&{} \text {if } y\in \mathcal {C}_{-1}^{(i)}\\ +1, &{} \text {if } y\in \mathcal {C}_{+1}^{(i)} \end{array}\right. } \end{aligned}$$
(B6)

where \(\widehat{y}\) is the pseudo class label for the true label y. The distinction between \(F_{\text {OAA}}^{(i)}\) and \(F_{\text {ES}}^{(i)}\) occurs with the definition of the pseudo class partitions.

Fig. 6
figure 6

Illustrating the difference in performance between the two malticlass partitioning strategies: one-against-all (OAA) and even-split (ES). The original Fashion MNIST dataset was used with QRF parameters \((T, M)=(1, 1024)\). We see that by increasing the maximum depth of the tree, the QDT has the ability to improve the model on multiclass problems, with the ES partition strategy easily outperforming OAA. It should be stressed that these results are over a single tree—not an ensemble

1.4 4.   Complexity analysis

One of the motivations of the QRF is to overcome the quadratic complexity with regard to training instances faced by traditional quantum kernel estimation. In this section, we discuss the complexities of the model quoted in the main text of the paper. We start with the quantum circuit sampling complexity—or simply the sampling complexity. For kernel elements, \(K_{ij}\), that are sampled as Bernoulli trials, to obtain an error of \(\epsilon \) requires \(\mathcal {O}(\epsilon ^{-2})\) shots. We then must account for the \(N\times L\) landmark columns that need to be estimated to obtain the Gram matrix for a single split node. Since each instance traverses at most \(d-1\) split nodes over T trees, we can give an upper bound sampling training complexity of \(\mathcal {O}\left( TL(d-1)N\epsilon ^{-2}\right) \). Testing over a single instance has a complexity of \(\mathcal {O}\left( TL(d-1)\epsilon ^{-2}\right) \) as the Nyström transformation requires the inner product of the test instance with the L landmark points. Comparatively, the regular QKE algorithm requires sampling complexity of \(\mathcal {O}(N^2\epsilon ^{-2})\) and \(\mathcal {O}(S\epsilon ^{-2})\) for training and testing respectively, where S is the number of support vectors. In Peters et al. (2021), it was found that a majority of the training points in SVMs with quantum kernels persisted as support vectors. Though this is largely dependent on the embedding and specific dataset, a similar characteristic was also observed, therefore allowing us to approximate \(\mathcal {O}(N)=\mathcal {O}(S)\).

It should be noted that the expensive training complexity provided for the QRF are upper bound. This is a result of being able to partition larger datasets into different trees. For example, partitioning N points into T trees therefore gives a complexity of \(\mathcal {O}\left( L(d-1)N\epsilon ^{-2}\right) \) for training—thereby, in effect, removing the T dependence. Furthermore, it is not uncommon for leaves to appear higher up the tree, not requiring instances to traverse the entire depth of the tree, with an effective depth, \(\mathbb {R}^{+}\ni d_\text {Eff}\le d\). Finally, since there are at most \(N^2\) inner products between the N instances, the sampling complexity of the QRF is always smaller than that of QKE with \(\mathcal {O}(N^2)\).

Though quantum kernel estimations are currently the most expensive, there is also the underlying classical optimisation required to train the SVM. Though there are improvements, solving the quadratic problem of Eq. A24 requires the inversion of the kernel matrix which has complexity of \(\mathcal {O}(N^3)\). In comparison, though not true with the implementation of this work (see Section B 1), the Nyström method has a complexity of \(\mathcal {O}(L^3 + L^2 N)\) (Li et al. 2015). This optimisation is then required at each split node in a QDT. For large datasets and a small number of landmark points, this is a substantial improvement—however, this is a general discussion where SVMs are infeasible for large-scale problems.

1.5 5.   Error bounds

In this section, we start by providing proof on the generalisation error bound of the QRF model by drawing its similarity to the well-studied classical perceptron decision tree. We then walk through the proof of the error bound given in the main text for the Nyström approximated kernel matrix. This will finally allow us to bound the error of the SVM-NQKE due to finite sampling—giving an indication of the complexity of circuit samples required.

1.5.1 a.   Proof of Lemma 1: generalisation error bounds on the QDT model

To see the importance of margin maximisation, we draw a connection between the QRF and the perceptron decision tree (PDT) that instead employs perceptrons for split functions. The difference lies in the minimisation problem, with roughly \(||w||^2 /2 + \lambda \sum _i [1-y_i(x_i \cdot w + b)]\) minimised for SVMs as opposed to only \(\sum _i [1-y_i \sigma (x_i \cdot w + b)]\) (where \(\sigma \) is some activation function) for perceptrons. Regardless, the split function remains identical, \(f(x) = \text {sign}(x \cdot w + b)\), with optimised wb. As a result, we are able to state the following theorem that applies to both QRFs and PDTs.

Theorem 9

(Generalisation error of perceptron decision trees, Theorem 3.10 in Bennett et al. (2000)). Let \(H_J\) be the set of all PDTs composed of J decision nodes. Suppose we have a PDT, \(h\in H_J\), with geometric margins \(\gamma _1, \gamma _2,..., \gamma _J\) associated to each node. Given that m instances are correctly classified, with probability greater than \(1-\delta \), we bound the generalisation error

$$\begin{aligned} \mathcal {R}(h) \le \frac{130 r^2}{m} \Bigg ( D' \log (4me) \log (4m) + \log \frac{(4m)^{J+1} {2J \atopwithdelims ()J}}{(J+1) \delta } \Bigg ) \end{aligned}$$
(B7)

where \(D'=\sum _{i=1}^J 1/\gamma _i^2\) and r is the radius of a sphere containing the support of the underlying distribution of x.

This theorem is obtained from an analysis of the fat-shattering dimension (V’yugin 2015) of hypotheses of the form \(\{f: f(x)=[\langle w, x\rangle + b>0]\}\). See Bennett et al. (2000) for a proof of this theorem. Now, using Stirling’s approximation on Eq. B6,

$$\begin{aligned} \mathcal {R}(h)\le \widetilde{\mathcal {O}}\left( \frac{r^2}{m} \left[ \log (4m)^2 \sum _{i=1}^J \gamma _i^{-2} + J \log \left( 4mJ^2\right) \right] \right) \end{aligned}$$
(B8)

where \(\widetilde{\mathcal {O}}(\cdot )\) hides the additional \(\log \) terms. In the context of the QRF model, both x and w lie in \(\mathcal {H}\) which implies the normalisation condition dictates \(||x|| = 1 =: r\). In Section A 2, it was seen that 1/||w|| is proportional to the geometric margin for an optimised SVM (note that this is not true for the PDT in Theorem 9), allowing us to write

$$\begin{aligned} \mathcal {R}(h) \le \widetilde{\mathcal {O}}\left( \frac{1}{m} \left[ \log (4m)^2 \sum _{i=1}^J ||w^{(i)}||^2 + J \log \left( 4mJ^2\right) \right] \right) \end{aligned}$$
(B9)

We are now required to obtain a bound for ||w|| that incorporates the kernel matrix. Using the definition of w from Eq. A12, we can write \(||w||^2 = \sum _{i,j}\alpha _i \alpha _j y_i y_j K_{ij}\). However, the dual parameters \(\alpha _i\) do not have a closed form expression and will therefore not provide any further insight in bounding the generalisation error. The equivalent formulation of the SVM problem in terms of the hinge-loss is also not helpful as it is non-differentiable. Instead, rather than observing the optimisation strategy of A 2, we make the assumption of minimising over the following mean squared error (MSE) which gives a kernel ridge regression formulation

$$\begin{aligned} \min _{\mathbf {w'}}\lambda ||\mathbf {w'}||^2 + \frac{1}{2}\sum _{i=1}^N \left( \mathbf {w'}^\top \cdot \phi (x_i) - y_i\right) ^2 \end{aligned}$$
(B10)

where we redefine \(w\cdot \phi '(x)+b\) as \(\textbf{w}\cdot \phi (x)\) with \(\textbf{w}:= [w^\top , b]^\top \) and \(\phi (x):=[\phi (x)'^\top ,1]^\top \) to simplify expressions. Importantly, the optimisation problem of Eq. B9 is of the form presented in Eq. A29, with the representer theorem still applicable. Taking the gradient with respect to \(\textbf{w}\) and solving for zero using the identity \((\lambda I + BA)^{-1}B = B(\lambda I + AB)^{-1}\), we get \(\textbf{w}_{\text {MSE}} = \Phi ^\top (K + \lambda I)^{-1} Y\). This gives \(||\textbf{w}_{\text {MSE}}||^2 = Y^\top (K + \lambda I)^{-1} K (K + \lambda I)^{-1} Y\) which is simplified in Huang et al. (2021); Wang et al. (2021) taking the regularisation parameter \(\lambda \rightarrow 0\) which assumes a greater importance for instances to be classified correctly. However, this approximation that gives \(||\textbf{w}_{\text {MSE}}||^2 = Y^\top K^{-1} Y\) has problems of invertibility given a singular kernel matrix—which in practice is common. Furthermore, in the case of the Nyström approximation, we have \(\text {det}[\widetilde{K}]=0\) by construction, making a non-zero regularisation parameter crucial to the expression. We therefore replace the inverse with the Moore-Penrose inverse, giving \(||\textbf{w}_{\text {MSE}}||^2 = |Y^\top K^+ Y|\) as \(\lambda \rightarrow 0\). Since 1/||w|| is proportional to the geometric margin and an SVM optimises for a maximum margin hyperplane, we can state that \(||\textbf{w}_{\text {SVM}}|| \le ||\textbf{w}_{\text {MSE}}||\) as \(\lambda \rightarrow 0\) and subsequently conclude the proof with

$$\begin{aligned} \mathcal {R}(h) \!\le \! \widetilde{\mathcal {O}}\left( \frac{1}{m} \left[ \!\log (4m)^2 \sum _{i=1}^J \left| Y^{(i)} (K^{(i)})^+ Y^{(i)\top }\right| \!+\! J \log \left( 4mJ^2\right) \!\right] \right) , \end{aligned}$$
(B11)

where \(K^{(i)}\) and \(Y^{(i)}\) are respectively the kernel matrix and labels given to node i in the tree. We therefore arrive at Lemma 1 by expanding the matrix multiplication as a sum over matrix elements \((K^{(i)})^+_{jl}\) and labels \(\{y^{(i)}_j\}_{j=1}^{N^{(i)}}\).

One should note that the possibility of a large margin is dependent on the distribution of data instances—hence losing the distribution-independent bound seen in bounds such as in Theorem 6. Nonetheless, this theorem is remarkable as the generalisation bound is not dependent on the dimension of the feature space but rather on the margins produced. This is in fact common strategy for bounding kernel methods, as there are cases in which kernels represent inner products of vectors in an infinite dimensional space. In the case of quantum kernels, bounds based on the Vapnik-Chervonenkis (VC) dimension would grow exponentially with the number of qubits. The result of Lemma 1 therefore suggests that large margins are analogous to working in a lower VC class.

1.5.2 b.   Proof of Lemma 2: error bounds on the difference between optimal and estimated kernel matrices

In the NQKE case, we only have noisy estimates of the kernel elements \(W_{ij}\rightarrow \widetilde{W}_{ij}\) and \(B_{ij}\rightarrow \widetilde{B}_{ij}\). Furthermore, the Nyström method approximates a large section of the matrix with the final kernel having the form

$$\begin{aligned} \widetilde{K} := \begin{bmatrix} \widetilde{W} &{} \widetilde{B} \\ \widetilde{B}^\top &{} \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B} \end{bmatrix}. \end{aligned}$$
(B12)

To bound the error between exact and estimated matrices, we start by explicitly writing out \(|| K - \widetilde{K} ||_2\) and using the triangle inequality to get the following:

$$\begin{aligned} || K - \widetilde{K} ||_2&= \left\| \begin{bmatrix} W - \widetilde{W} &{} B - \widetilde{B} \\ B^\top - \widetilde{B}^\top &{} C - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B} \end{bmatrix}\right\| _2 \end{aligned}$$
(B13)
$$\begin{aligned}&\le \left\| \begin{bmatrix} W - \widetilde{W} &{} 0 \\ 0 &{} 0 \end{bmatrix}\right\| _2 + \left\| \begin{bmatrix} 0 &{} B - \widetilde{B} \\ B^\top - \widetilde{B}^\top &{} 0 \end{bmatrix}\right\| _2 \nonumber \\&+ \left\| \begin{bmatrix} 0 &{} 0 \\ 0 &{} C - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B} \end{bmatrix}\right\| _2 \end{aligned}$$
(B14)

where C is the exact \((N-L)\times (N-L)\) kernel matrix of the set of data points not selected as landmark points, i.e. \(C_{ij} = k(w_i, w_j )\) for \(w_i \in \mathcal {L}'\). The first and second term can be reduced to \(\left\Vert W - \widetilde{W}\right\Vert _2\) and \(\left\Vert B - \widetilde{B}\right\Vert _2\) respectively. The last term can also be reduced to \(\left\Vert C - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2\) and bounded further

$$\begin{aligned} \left\Vert C - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2&\!=\! \left\Vert (C \!-\! B^\top W^{-1} B) \!+\! (B^\top W^{-1} B \!-\! \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B})\right\Vert _2 \end{aligned}$$
(B15)
$$\begin{aligned}&\le \left\Vert C - B^\top W^{-1} B\right\Vert _2 \nonumber \\&+ \left\Vert B^\top W^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \end{aligned}$$
(B16)

The first term is precisely the classical bound on the Nyström approximation provided in Eq. A51. We now obtain probabilistic bounds on the three terms \(\left\Vert W - \widetilde{W}\right\Vert _2\), \(\left\Vert B - \widetilde{B}\right\Vert _2\) and \(\left\Vert B^\top W^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2\) as a result of the quantum kernel estimations. In this work, the estimation error occurs as a result of finite sampling error for each element in the kernel matrix. We can subsequently derive a bound on the normed difference between expected and estimated kernels by bounding the Bernoulli distribution of individual matrix element estimations. However, first, we construct a bound on the spectral norm of the error in the kernel matrix through the following lemma.

Lemma 4

Given a matrix \(A\in \mathbb {R}^{d_1\times d_2}\) and the operator 2-norm (spectral norm) \(||\cdot ||_2\), we have,

$$\begin{aligned} \Pr \big \{\left\Vert A\right\Vert _2 \ge \delta \big \} \le \sum _{i=1}^{d_1} \sum _{j=1}^{d_2} \Pr \Bigg \{|A_{ij}| \ge \frac{\delta }{\sqrt{d_1 d_2}} \Bigg \} \end{aligned}$$
(B17)

where the matrix elements \(A_{ij}\) of A are independently distributed random variables.

Proof

We start by using the inequality \(\left\Vert A\right\Vert _2 \le \left\Vert A\right\Vert _\text {F}\) where \(\left\Vert \cdot \right\Vert _\text {F}\) refers to the Frobenius norm

$$\begin{aligned} \Pr \big \{\left\Vert A\right\Vert _2 \ge \delta \big \}&\le \Pr \big \{\left\Vert A\right\Vert _\text {F} \ge \delta \big \} \end{aligned}$$
(B18)
$$\begin{aligned}&= \Pr \Bigg \{\sum _{i=1}^{d_1} \sum _{j=1}^{d_2} |A_{ij}|^2 \ge \delta ^2\Bigg \} \end{aligned}$$
(B19)
$$\begin{aligned}&\le \Pr \Bigg \{\bigcup _{i=1}^{d_1}\bigcup _{j=1}^{d_2} \Big ( |A_{ij}|^2 \ge \frac{\delta ^2}{d_1 d_2} \Big )\Bigg \} \end{aligned}$$
(B20)
$$\begin{aligned}&\le \sum _{i=1}^{d_1} \sum _{j=1}^{d_2} \Pr \Bigg \{|A_{ij}| \ge \frac{\delta }{\sqrt{d_1 d_2}} \Bigg \} \end{aligned}$$
(B21)

where the second line uses the definition of the Frobenius norm, and the third line uses the fact that of the \(d_1 d_2\) summed terms, at least one must be greater than \(\delta ^2 / d_1 d_2\). The last inequality utilises the union bound. \(\square \)

To bound the RHS of Eq. B16, we use well-known Hoeffding’s inequality.

Theorem 10

(Hoeffding’s inequality) Let \(X_1,..., X_M\) be real bounded independent random variables such that \(a_i \le X_i \le b_i\) for all \(i=1,...,M\). Then, for any \(t>0\), we have

$$\begin{aligned} \Pr \Big \{ \Big |\sum _{i=1}^M X_i - \mu \Big |\ge t\Big \} \le 2 \exp \Bigg ( -\frac{2t^2}{\sum _{i=1}^M (b_i - a_i)^2} \Bigg ) \end{aligned}$$
(B22)

where \(\mu =\mathbb {E} \sum _{i=1}^M X_i\).

Utilising Lemma 4 and Theorem 10, with probability at least \(1-\Lambda \), we have the following bounds:

$$\begin{aligned} \left\Vert W - \widetilde{W}\right\Vert _2&\le L \sqrt{\frac{1}{2M} \log \Big (\frac{2L^2}{\Lambda }\Big )} \end{aligned}$$
(B23)
$$\begin{aligned} \left\Vert B - \widetilde{B}\right\Vert _2&\le \sqrt{\frac{NL}{2M} \log \Big (\frac{2NL}{\Lambda }\Big )} \end{aligned}$$
(B24)

The last term \(\left\Vert B^\top W^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2\) is a bit more involved as we have matrix multiplication of random matrices. Furthermore, the existence of inverse matrices renders Hoeffding’s inequality invalid as the elements of an inverse random matrix are not, in general, independent. Hence, we further this term to give

$$\begin{aligned}&\left\Vert B^\top W^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2\nonumber \\&= \left\Vert B^\top \left( W^{-1} - \widetilde{W}^{-1}\right) B +B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \end{aligned}$$
(B25)
$$\begin{aligned}&\le \left\Vert B^\top \left( W^{-1} - \widetilde{W}^{-1}\right) B\right\Vert _2 +\left\Vert B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \end{aligned}$$
(B26)
$$\begin{aligned}&\le \left\Vert W^{-1} - \widetilde{W}^{-1}\right\Vert _2\left\Vert B\right\Vert _2^2 +\left\Vert B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \end{aligned}$$
(B27)

where the last inequality uses the fact that for symmetric Y, we have the inequality \(\left\Vert Z^\top YZ\right\Vert _2 \le \left\Vert Y\right\Vert _2 \left\Vert Z\right\Vert _2^2\), which can be shown by diagonalising Y and using the fact that \(\left\Vert ZZ^\top \right\Vert _2 = \left\Vert Z\right\Vert _2^2\). We first look at bounding the second term, which will also help illustrate the process by which Eqs. B22 and B23 were obtained. We can expand the matrix multiplication such that we aim to bound the following:

$$\begin{aligned}&\Pr \Big \{ \Big |\Big (B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B} \Big )_{ij}\Big |\ge \delta '\Big \} \nonumber \\&= \Pr \Bigg \{ \Bigg |\mu _{ij}^{\widetilde{W}} - \sum _{l, k =1}^L \widetilde{B}_{ki} \widetilde{W}^{-1}_{kl}\widetilde{B}_{lj} \Bigg |\ge \delta '\Bigg \} \end{aligned}$$
(B28)

where for a specific \(\widetilde{W}\) we have \(\mu _{ij}^{\widetilde{W}}= \mathbb {E} \sum _{l, k =1}^L \widetilde{B}_{ki} \widetilde{W}^{-1}_{kl}\widetilde{B}_{lj}\) \(= (B^\top \widetilde{W}^{-1} B)_{ij}\). Since \(\widetilde{B}_{ij}\) is estimated from M Bernoulli trials we can write, \(\widetilde{B}_{ij} = (1/M) \sum _{m=1}^M \widetilde{B}_{ij}^{(m)}\) where \(\widetilde{B}_{ij}^{(m)}\in \{0,1\}\). It is crucial to note that the inverse matrix \(\widetilde{W}^{-1}_{ij}\) is fixed over the expression above—as intended from Eq. B26. We therefore take the elements \(\{\widetilde{W}^{-1}_{ij}\}_{i,j=1}^{L}\) to be constant variables bounded by \(0\le \widetilde{W}^{-1}_{ij}\le r_{\widetilde{W}}\) for all \(i,j=1,..., L\) and now expand expression Eq. B27:

$$\begin{aligned}&\Pr \Bigg \{ \Bigg |\mu _{ij} - \sum _{l, k =1}^L \widetilde{B}_{ki} \widetilde{W}^{-1}_{kl}\widetilde{B}_{lj} \Bigg |\ge \delta '\Bigg \} \nonumber \\&= \Pr \Bigg \{ \Bigg |\mu _{ij} - \sum _{l, k =1}^L \frac{1}{M^2}\Bigg (\sum _{m_1 =1}^M \widetilde{B}_{ki}^{(m_1)}\Bigg ) \widetilde{W}^{-1}_{kl}\Bigg (\sum _{m_2 =1}^M \widetilde{B}_{lj}^{(m_2)}\Bigg ) \Bigg |\ge \delta '\Bigg \} \end{aligned}$$
(B29)
$$\begin{aligned}&:= \Pr \Bigg \{ \Bigg |\mu _{ij} - \frac{1}{M^2}\sum _{l, k =1}^L \sum _{m=1}^{M^2} \omega ^{(m)}_{iklj} \Bigg |\ge \delta '\Bigg \}, \end{aligned}$$
(B30)

where in the second line we have expanded the sums over \(m_1\) and \(m_2\) into a single sum over \(M^2\) terms. The random variables \(\omega ^{(m)}_{iklj}\) are bounded by r, and we utilise the concentration inequality in Theorem 10 to bound the term above:

$$\begin{aligned}&\Pr \Bigg \{ \Bigg |\mu _{ij} - \sum _{l, k =1}^L \sum _{m=1}^{M^2} \frac{\omega ^{(m)}_{iklj}}{M^2} \Bigg |\ge \delta '\Bigg \} \nonumber \\&\le 2 \exp \Bigg (-\frac{2 \delta '^2}{\sum _{l, k =1}^L \sum _{m=1}^{M^2} (r_{\widetilde{W}}/M^2)^2} \Bigg ) \end{aligned}$$
(B31)
$$\begin{aligned}&= 2 \exp \Bigg (-\frac{2 M^2 \delta '^2}{r_{\widetilde{W}}^2 L^2} \Bigg ) . \end{aligned}$$
(B32)

We therefore have a bound on \(\left\Vert B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2\) by utilising Lemma 4, and Eqs. B27, B31, to give

$$\begin{aligned}&\Pr \Big ( \left\Vert B^\top \widetilde{W}^{-1} B - \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \ge \delta \Big ) \nonumber \\&\le 2(N-L)^2 \exp \Bigg ( -\frac{2M^2 \delta ^2}{r_{\widetilde{W}}^2 L^2 (N-L)^2}\Bigg ). \end{aligned}$$
(B33)

Letting the RHS be equal to \(\Lambda \) and solving for \(\delta \), one can show with probability \(1-\Lambda \),

$$\begin{aligned} \left\Vert B^\top \widetilde{W}^{-1} B \!-\! \widetilde{B}^\top \widetilde{W}^{-1} \widetilde{B}\right\Vert _2 \!\le \! \frac{r_{\widetilde{W}}L(N-L)}{M}\sqrt{\frac{1}{2} \log \Bigg ( \frac{2(N-L)^2}{\Lambda }\Bigg )} . \end{aligned}$$
(B34)

This bounds the second term in Eq. B26 leaving us with \(\left\Vert W^{-1} - \widetilde{W}^{-1}\right\Vert _2\), which can be bound

$$\begin{aligned} \left\Vert W^{-1} - \widetilde{W}^{-1}\right\Vert _2&\le c_{W^{-1}} c_{\widetilde{W}^{-1}}\left\Vert W-\widetilde{W}\right\Vert _2 \\&\le c_{W^{-1}} c_{\widetilde{W}^{-1}}\sqrt{\frac{L^2}{2M} \log \left( \frac{2L^2}{\Lambda }\right) } \end{aligned}$$
(B35)

where we define \(c_A:=\left\Vert A\right\Vert _2\) for a given matrix A with \(A^{-1}\) being the pseudo-inverse for when A is singular. The first inequality uses the fact that \(\left\Vert W^{-1} - \widetilde{W}^{-1}\right\Vert = \left\Vert W^{-1} (\widetilde{W}-W) \widetilde{W}^{-1}\right\Vert \le \left\Vert W^{-1}\right\Vert \left\Vert \widetilde{W}^{-1}\right\Vert \left\Vert \widetilde{W}-W\right\Vert \) and the second inequality uses the bound in Eq. B22. Now, putting together the bounds in Eqs. B12, B15, B23, B22, B33, B26, B35 and A51, we have

$$\begin{aligned} || K - \widetilde{K} ||_2 \le L \sqrt{\frac{1}{2M} \log \Bigg (\frac{2L^2}{\Lambda }\Bigg )}&+ \sqrt{\frac{NL}{2M} \log \Bigg (\frac{2NL}{\Lambda }\Bigg )} \nonumber \\&+ \frac{r_{\widetilde{W}}L(N-L)}{M}\sqrt{\frac{1}{2} \log \Bigg ( \frac{2(N-L)^2}{\Lambda }\Bigg )} \end{aligned}$$
(B36)
$$\begin{aligned}&+c_B^2 c_{W^{-1}} c_{\widetilde{W}^{-1}}\sqrt{\frac{L^2}{2M} \log \left( \frac{2L^2}{\Lambda }\right) } + \mathcal {O}\Bigg (\frac{N}{\sqrt{L}}\Bigg ) \nonumber \\ \implies || K - \widetilde{K} ||_2 \le \widetilde{\mathcal {O}} \Bigg (\frac{NL}{M} +&\frac{N}{\sqrt{L}}\Bigg ) \end{aligned}$$
(B37)

where in the second line we use the fact that \(N>>L\) and the notation \(\widetilde{\mathcal {O}}\) to hide \(\log \) dependencies. Furthermore, Eq. B37 assumes that the matrix norms \(c_B, c_{W^{-1}},c_{\widetilde{W}^{-1}}\) do not depend on the system dimension—i.e. L and N. For cases of \(c_{W^{-1}}\) and \(c_{\widetilde{W}^{-1}}\), since we are really taking the pseudo-inverse by selecting a k rank approximation of W with the k largest positive eigenvalues, it is possible in some sense retain control over these terms. The norm of B on the other hand is entirely dependent on the dataset. The worst case scenario suggests a bound using the Frobenius norm, \(c_B\le ||B||_F =\sqrt{\sum _{i}^N\sum _{j}^L|B_{ij}|^2} \approx \sqrt{L(N-L)\overline{k}^2}\), where \(\overline{k}\) is the average kernel element. This is however a loose upper bound that provides no further insight, and it is often the case \(\overline{k}<<1/N\) for quantum kernels.

One of the main objectives of Eq. B37 is to understand the sampling complexity required to bound the error between estimated and real kernels. We find that to ensure this error is bounded, we require \(M\sim NL\). However, this does not account for the potentially vanishing kernel elements that we discuss further in B 8. Equation B37 also highlights the competing interests of increasing L to obtain a better Nyström approximation while at the same time benefiting from decreasing L to reduce the error introduced from finite sampling many elements.

1.5.3 c.   Proof of Lemma 3: prediction error of SVM-NQKE due to finite sampling

In the previous section, we obtained a bound to \(\left\Vert K-\widetilde{K}\right\Vert \) that suggested \(M\sim NL\) to bound this term with respect to the error of finite samples that is introduced. This however does not identify the samples required in order to bound the error on the output of the SVM. In the noiseless case with Nyström approximation, given that we have hypothesis of the form, \(h(x) = \text {sign}\left( \sum _{i=1}^N \alpha _i k(x, x_i)\right) \), where \(\alpha \) is obtained at training, let \(f(x)=\sum _{i=1}^N \alpha _i k(x, x_i)\) so that \(h(x) = \text {sign}\left( f(x)\right) \). We equivalently let the SVM constructed through the estimated kernel be denoted as \(\widetilde{h}(x) = \text {sign}\left( \widetilde{f}(x)\right) \) where \(\widetilde{f}(x)=\sum _{i=1}^N \alpha _i' k'(x, x_i)\). In this section, closely following Liu et al. (2021), we obtain a bound on \(|f(x)-\widetilde{f}(x)|\) that will elucidate the samples required to suppress the shot noise of estimation. We start by expanding the following:

$$\begin{aligned} |f(x)-\widetilde{f}(x)|= & {} \left|\sum _{i=1}^N \alpha _i^{\prime } k^{\prime }(x, x_i) - \sum _{i=1}^N \alpha _i k(x, x_i) \right|\end{aligned}$$
(B38)
$$\begin{aligned}\le & {} \sum _{i=1}^N \left|\alpha _i^{\prime } k^{\prime }(x, x_i) - \alpha _i k(x, x_i) \right|\end{aligned}$$
(B39)
$$\begin{aligned}= & {} \sum _{i=1}^N |(\alpha _i^{\prime }-\alpha _i) \left[ k^{\prime }(x, x_i)-k(x, x_i)\right] \nonumber \\+ & {} \alpha _i \left[ k^{\prime }(x, x_i)\!-\!k(x, x_i)\right] \!+\! (\alpha _i^{\prime }\!-\!\alpha _i) k(x, x_i)|\end{aligned}$$
(B40)
$$\begin{aligned}\le & {} \left\Vert \alpha \right\Vert _2 \cdot \left\Vert \nu (x)\right\Vert _2 + \left\Vert \alpha ^{\prime }-\alpha \right\Vert _2 \cdot \left\Vert \nu (x)\right\Vert _2\nonumber \\+ & {} \sqrt{N}\left\Vert \alpha ^{\prime }-\alpha \right\Vert _2 , \end{aligned}$$
(B41)

where we let \(\nu _i(x) =\nu _i= k'(x, x_i)-k(x, x_i)\) and where the last line uses the Cauchy-Schwarz inequality and the fact that \(k(\cdot ,\cdot )\le 1 \). The term \(\left\Vert \nu \right\Vert _2\) is bounded in a similar fashion to Eq. B26, as the kernel function is passed through the Nyström feature map (NFM). This gives \(\left\Vert \nu \right\Vert _2 \le \mathcal {O}(\sqrt{L^2/2M}+ NL/M)\) where we assume \(c_{k}, c_{W^{-1}},c_{\widetilde{W}^{-1}}\) to be bounded independent of N. Both terms, \(\left\Vert \alpha \right\Vert _2\) and \(\left\Vert \alpha '-\alpha \right\Vert _2 \), need slightly more work.

The SVM dual problem in Eq. A10 can be written in the following quadratic form for \(\mu =0\) with regularisation \(\lambda \):

$$\begin{aligned} \text {minimise \ \ }&\frac{1}{2}\alpha ^\top \left( Q + \frac{1}{\lambda }\mathbb {I}\right) - 1^\top \alpha \nonumber \\ \text {subject to: \ \ }&\alpha _i \ge 0, \forall i=1, ...., N. \end{aligned}$$
(B42)

where \(Q_{ij}=y_i y_j K_{ij}\). This allows us to use the following robustness Lemma on quadratic programs.

Lemma 5

(Daniel (1973), Theorem 2.1) Given a quadratic program of the form in Eq. B42, let \(Q'\) be a perturbation of Q with a solution of \(\alpha '\) and \(\alpha \) respectively. Given that \(\left\Vert Q'-Q\right\Vert _\text {F}\le \epsilon < \lambda _{\text {min}}\), where \(\lambda _{\text {min}}\) is the minimum eigenvalue of \(Q+\frac{1}{\lambda }\mathbb {I}\), then

$$\begin{aligned} \left\Vert \alpha '-\alpha \right\Vert _2 \le \epsilon (\lambda _{\text {min}}- \epsilon )^{-1}\left\Vert \alpha \right\Vert _2. \end{aligned}$$
(B43)

Note, this lemma assumes small perturbations and breaks down for when \(\epsilon \ge \lambda _{\text {min}}\). It can be shown from analysing the solutions from the KKT conditions that \(\mathbb {E}[\left\Vert \alpha \right\Vert _2^2] = \mathcal {O}(N^{2/3})\)—see Liu et al. (2021) (Lemma 16) for further details. This finally leaves us with bounding \(\left\Vert Q' - Q\right\Vert _\text {F}\). We know

$$\begin{aligned} \left\Vert Q' - Q\right\Vert _\text {F}^2&= \sum _{ij} |Q'_{ij} - Q_{ij}|^2 = \sum _{ij} |y_i y_j K'_{ij} - y_i y_j K_{ij}|^2 \end{aligned}$$
(B44)
$$\begin{aligned}&= \sum _{ij} |K'_{ij} - K_{ij}|^2 = \left\Vert K' - K\right\Vert _\text {F}^2. \end{aligned}$$
(B45)

Since we are bounding the error between estimated and real matrices (both including Nyström approximation), we in fact require the bound \(\left\Vert \widetilde{K} - \widehat{K}\right\Vert _\text {F}\) from Eqs. A45 and B11, i.e. we have \(K=\widehat{K}\) and \(K'=\widetilde{K}\). We compute this in Section B 5 b to be \(\left\Vert K' - K\right\Vert _\text {F}\le \mathcal {O}\left( \frac{NL}{M} + \sqrt{\frac{NL}{M}}\right) \)—noting that we always weakly bound the spectral norm with the Frobenius norm, \(\left\Vert \cdot \right\Vert _2\le \left\Vert \cdot \right\Vert _\text {F}\). Therefore, using Lemma 5 with the fact that \(\lambda _\text {min}\ge 1/\lambda \) for constant \(\lambda \), we have

$$\begin{aligned} \left\Vert \alpha '-\alpha \right\Vert _2 \le \mathcal {O}\left( \frac{N^{5/6}\sqrt{L}}{\sqrt{M}}\right) . \end{aligned}$$
(B46)

Substituting these results into Eq. B41, we obtain the bound on the error of the model output

$$\begin{aligned} |f(x)-\widetilde{f}(x)| \le \mathcal {O}\left( \frac{N^{4/3}\sqrt{L}}{\sqrt{M}} \right) = \mathcal {O}\left( \sqrt{\frac{N^{8/3}L}{M}} \right) , \end{aligned}$$
(B47)

as required. Therefore, we see that \(M\sim N^3 L\) is sufficient to suppress error in the model output. Note that Eq. B47 only shows the term most difficult to suppress with respect to M.

1.6 6.   Processing datasets for numeric analysis

This work observes the performance of the QRF on five publicly available datasets. In certain cases, continuous features were selected as class labels by introducing a separating point in \(\mathbb {R}\) such as to split the number of data points evenly in two. A summary of the original datasets is presented in the table below.

Datasets

Name

Denoted

Number of points

\(|\mathcal {X}|\)

Class labels

References

Fashion MNIST

\(\mathcal {D}_{\text {FM}}\)

70,000

\(28\times 28\)

Originally 10 classes with 7000 per class. Binary class transformation uses class 0 (top) and 3 (dress).

Zolando’s article images (Xiao et al. 2017)

Breast cancer

\(\mathcal {D}_{\text {BC}}\)

569

30

This is a binary class dataset.

UCI ML Repository (Dua and Graff 2017)

Heart disease

\(\mathcal {D}_{\text {H}}\)

303

13

The class is defined as 0 and 1 for when there is respectively a less than or greater than \(50\%\) chance of disease.

UCI ML Repository (Dua and Graff 2017)

Ionosphere

\(\mathcal {D}_{\text {Io}}\)

351

34

This is a binary class dataset.

UCI ML Repository (Dua and Graff 2017)

Note: \(|\mathcal {X}|\) indicates the dimension of the feature space after accounting for any removal of categorical features and the possible use of a feature to create class labels. Finally, the analysis of various QRF parameters uses the Fashion MNIST dataset. To ensure the analysis has tractable computation, 300 points are sampled for each training and testing (unless otherwise stated)—which partially accounts for the variance in accuracies observed

1.6.1 a.   Data pre-processing

Given the original dataset \(\mathcal {S}_{\text {tot}}'' = \{ (x_i'', y_i)\}_{i=1}^N\) where \(x_i'' \in \mathbb {R}^{D'}\) and \(y_i \in \{-1, +1\}\) in this section, we show the pre-processing steps undertaken prior to the QRF learning process. The first step mean-centres each of the features of the dataset and employs principal component analysis (F.R.S. 1901) to obtain \(D<D'\) dimensional points, giving the dataset \(\mathcal {S}_{\text {tot}}'= \{ (x_i', y_i)\}_{i=1}^N\) where \(x' \in \mathbb {R}^D\). In practice, this dimensionality reduction technique is a common pre-processing step for both classical and quantum ML. The technique selects the D-most primary directions in which the data is spread and subsequently projects the data points onto the plane spanned by those directions—see Mohri and Rostamizadeh (2019) for more details. This gives us control over the dimensionality of the data given to the QRF—this being especially important with the \(\Phi _{\text {IQP}}\) embedding over n qubits requiring an exact dimension of n. Furthermore, PCA relies on the principle that data often lies in a lower dimensional manifold (Goodfellow et al. 2016), hence resulting in PCA reduction often aiding the learning model to avoid noise. Therefore, it becomes crucial to include the pre-processing step of a PCA reduction when comparing between learning models. This highlights observation in the main text where we see a degradation in performance when increasing the number of qubits for the Fashion MNIST data. Though it may seem as though we can make a statement about the ineffectiveness of quantum kernels with increasing qubit numbers, this misses the effect of PCA in the pre-processing step. This is why both the classical and quantum models struggle when D is increased.

The second step in pre-processing is unique to quantum models. Before data can be inserted into the QRF algorithm, we are required to process these classical vectors \(x' \in \mathbb {R}^D\) so that features can be effectively mapped to single qubit rotations. This is done by observing each dimension (feature), \(\Omega _i\), of the space consumed by the set of all data points (including training and testing) and normalising so that the range of values lies between 0 and \(\pi \). Mathematically, we make the transformation, \(x_i' \rightarrow x_i = \pi (r^{\text {min}}_i + x') /(r^{\text {max}}_i - r^{\text {min}}_i)\) where \(r^{\text {min}}_i = \min \{x' | (x', y) \in \mathcal {S}_{\text {tot}}' \}\) and \(r^{\text {max}}_i\) is defined similarly. The final dataset is therefore given by \(\mathcal {S}_{\text {tot}} = \{ (x_i, y)\}_{i=1}^N\) where \(x_i \in [0, \pi ]^{D}\) and \(y_i \in \{-1, +1\}\). The embedding into half a rotation \([0, \pi ]\), as opposed to full rotations of \( [0, 2\pi ]\) or \( [-\pi , \pi ]\), is to ensure that the ends of the feature spectrum, \(r^{\text {min}}\) and \(r^{\text {max}}\), are not mapped to similar quantum states.

1.6.2 b.   Relabelling datasets for optimal quantum kernel performance, \(\mathscr {R}_{\Phi }^{\text {QK}}\)

To demonstrate that QML is able to achieve an advantage over its classical counterpart, a crucial first step would be to show that there at least exist artificial datasets in which quantum models outperform classical methods. We generate such a dataset by relabelling a set of real data points so that a classical kernel struggles and a quantum kernel does not. This is done by maximising the model complexity—and hence generalisation error—of the classical kernel while at the same time minimising the model complexity of the quantum kernel. This is precisely the strategy followed in Huang et al. (2021), where the regime of potential quantum advantage was posited to be related to—what the authors referred to as—the geometric difference, G, between quantum and classical models. This quantity was shown to provide an upper bound to the ratio of model complexities—more specifically, \(G_{CQ} \ge \sqrt{s_C / s_Q}\). The task of relabelling is therefore to solve the following optimisation problem:

$$\begin{aligned} Y^* = \mathop {\mathrm {arg\,max}}\limits _{Y\in \{0,1\}^N} \frac{s_C}{s_Q} = \mathop {\mathrm {arg\,max}}\limits _{Y\in \{0,1\}^N} \frac{Y^\top (K^C)^{-1}Y}{Y^\top (K^Q)^{-1}Y} \end{aligned}$$
(B48)

where \(Y^* =[y_1,..., y_N]\) are the optimal relabelled instances and \(K^C\) and \(K^Q\) are Gram matrices for the classical and quantum kernels, respectively. This is related to the process of maximising the kernel target alignment measure (Cristianini et al. 2006), where in order for a kernel to have high performance, the targets, Y, must be well aligned with the transformed set of instances. The quantity is more intuitive and has the form

$$\begin{aligned} \mathcal {T}(K) = \frac{\sum _{i=1}^N\sum _{j=1}^NK_{ij}y_i y_j }{N\sqrt{\sum _{i=1}^N\sum _{j=1}^N K_{ij}^2}} = \frac{Y^\top K Y}{N\left\Vert K\right\Vert _F} \end{aligned}$$
(B49)

where \(||\cdot ||_F\) is the Frobenius norm. Therefore, the equivalent optimisation problem of Eq. B48 becomes the minimisation of the quantity \(\mathcal {T}(K^C)/\mathcal {T}(K^Q)\). Nevertheless, the optimisation problem in Eq. B48 can be transformed to a generalised eigenvalue problem of the following form by replacing \(Y\rightarrow \phi \in \mathbb {R}^N\):

$$\begin{aligned} (K^C)^{-1} \phi = \lambda (K^Q)^{-1} \phi \end{aligned}$$
(B50)

where \(\phi \) and \(\lambda \) are the eigenvectors and eigenvalues respectively. As Eq. B48 is a maximisation problem, the solution is therefore the eigenvector \(\phi ^*\) associated with the largest eigenvalue \(\lambda ^*\). To obtain an exact solution, we use the fact that the generalised eigenvalue decomposition is related to the regular decomposition of \(\sqrt{K^Q}(K^C)^{-1}\sqrt{K^Q}\) with identical eigenvalues \(\lambda _i\). Hence, supposing a decomposition of \(\sqrt{K^Q}(K^C)^{-1}\sqrt{K^Q}=Q\Lambda Q^\top \), where \(\Lambda = \text {diag}(\lambda _1,..., \lambda _N)\), it can be shown that \(\phi ^* = \sqrt{K^Q}q^*\) (Boyd and Vandenberghe 2004), where \(q^*\) is the eigenvector associated with the largest eigenvalue, \(\lambda ^*\). The continuous vector \(\phi ^*\) can then be transformed into labels by simply taking the sign of each entry. Furthermore, following Huang et al. (2021), we also inject randomness, selecting the label—for a given data point \(x_i\) – as \(y_i = \text {sign}(\phi _i)\) with probability 0.9 and randomly assigning \(y_i = \pm 1\) otherwise. This relabelling strategy is denoted as \((y_i)_i = \mathscr {R}_{\Phi }^{\text {QK}}(\{(x_i, y_i')\}_i)\).

It is important to note that the relabelling of datasets is dependent on the specific quantum and classical kernels employed. In this work, the radial basis function kernel is used as the classical kernel, and we obtain different labels depending on the quantum embedding used, \(\Phi _{\text {IQP}}\) or \(\Phi _{\text {Eff}}\).

1.6.3 c.   Relabelling datasets for optimal QRF performance, \(\mathscr {R}_{\Phi }^{\text {QRF}}\)

Classically, SVMs require kernels in cases where points in the original feature space are not linearly separable. A simple example is shown in Fig. 7a where any single hyperplane in this space cannot separate the two classes. However, it is clear that a tree structure with multiple separating hyperplanes has the ability to separate the clusters of points. This inspires a relabelling strategy that attempts to separate instances into four regions using a projection to a particular axis. Since we are able to only compute the inner products of instances and do not have access to the direct induced Hilbert space, we randomly select two points, \(x'\) and \(x''\), to form the axis of projection. Projection of point z onto this axis is given by the elementary vector projection

$$\begin{aligned} \frac{\langle z - x', x'' - x'\rangle _{\mathcal {H}}}{\sqrt{|\langle x''-x', x''-x'\rangle _{\mathcal {H}}|}} \end{aligned}$$
(B51)

where we have specified the inner product over the reproducing Hilbert space \(\mathcal {H}\) associated with some quantum kernel k. Using the fact that points \(x'\) and \(x''\) are fixed, we define a projection measure of the form

$$\begin{aligned} P(x_i) = \langle x_i , x'' \rangle _{\mathcal {H}} - \langle x_i , x' \rangle _{\mathcal {H}} = k(x_i , x'') - k(x_i , x') := P_i \end{aligned}$$
(B52)

where \(P_i:= P(x_i)\) for \(x_i \in \mathcal {S}|_{\mathcal {X}}\). We now label the classes such that we have alternating regions

$$\begin{aligned} Y_i = {\left\{ \begin{array}{ll} 0 &{} \text {if } P_i< P^{(\text {q1})} \text { or } P^{(\text {q2})} \le P_i <P^{(\text {q3})}\\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(B53)

where \(P^{(\text {q1})}, P^{(\text {q2})}, P^{(\text {q3})}\) are the first, second and third quartiles of \(\{P_i\}_{i=1}^N\), respectively. An illustration of this relabelling process can be seen in Fig. 7a. One should note, unlike the relabelling strategy of Section B 6 b, here we do not compare against the performance of the classical learner. The intuition—and what we observe numerically—is that a clear division in the quantum-induced Hilbert space does not translate to a well-defined division employing classical kernels. Therefore, we have a labelling scheme that is in general both hard for classical learners as well as QSVMs. This relabelling process will be referred to as \((y_i)_i = \mathscr {R}_{\Phi }^{\text {QRF}}(\{(x_i, y_i')\}_i)\).

Fig. 7
figure 7

Both figures depict concept classes that are not learnable by simple hyperplanes. a Two points are randomly selected (green) for the relabelling process \(\mathscr {R}_\Phi ^{\text {QRF}}\) elaborated in B 6 c. This creates an alternating pattern in \(\mathcal {H}_\Phi \). Note that in high dimensions, points become sparsely separated, resulting in only a slight advantage of the QRF against other learners. b Here, we observe an extension of the DLP classification problem (Liu et al. 2021) into two dimensions. The torus is split into four alternating regions at \(s'\) and \(s''\). Since this separation exists in log-space, classical learners are presumed unable to solve this problem due to the hardness of solving DLP. Furthermore, since multiple hyperplanes are required, a simple quantum kernel also fails

1.7 7.   Learning advantage with a QRF

In the main text, we asserted that the QRF can extend the linear hypotheses available to linear quantum models. In this section, we show that there are problems that are both difficult for quantum kernels (and by extension QNNs (Schuld 2021)) and classical learners. We claim advantages using the QRF model by observing the hypothesis set available as a result of the QDT structure. We assume that we have \(L=N\) models. In other words, we look at QDTs that do not take a low-rank Nyström approximation.

To show there exist concept classes that are unlearnable by both classical learners and quantum linear models, we extend the DLP-based concept class given in Liu et al. (2021). The work looks at the class \(\mathscr {C}_{\text {QK}}= \{f_s^{\text {QK}}\}_{s\in \mathbb {Z}_p^*}\) where

$$\begin{aligned} f_s^{\text {QK}}(x) = {\left\{ \begin{array}{ll} +1 &{} \text {if} \log _g x \in [s, s+\frac{p-3}{2}] \\ -1 &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(B54)

and \(x\in \mathbb {Z}_p^*\), for generator g of \(\mathbb {Z}_p^* = \{1,2,...,p-1\}\) with large prime p. Each concept of the concept class divides the set \(\mathbb {Z}_p^*\) into two sets, giving rise to a binary class problem. It is important to note that the problem is a trivial 1D classification task on the assumption that one can efficiently solve the DLP problem. However, it can be shown that the DLP problem lies in BQP and is otherwise presumed to be classically intractable. This gives rise to a quantum advantage in learning the concept class \(\mathscr {C}_{\text {QK}}\) with a quantum kernel with the following quantum feature map:

$$\begin{aligned} |\Phi _{\text {DLP}}^q(x_i)\rangle = \frac{1}{\sqrt{2^q}} \sum _{j\in \{0,1\}^q} |x_i \cdot g^j\rangle \end{aligned}$$
(B55)

where \(x_i \in \{0,1\}^n\) over \(n=\lceil \log _2 p \rceil \) qubits, and \(q=n-t\log n\) for some fixed t. This state can be efficiently prepared with a fault-tolerant quantum computer using Shor’s algorithm (Shor 1997). Such embeddings can be interpreted as interval states, as they are a superposition over an interval in log-space, \([\log x, \log x + 2^q - 1]\). The subsequent kernel function, \(k(x_i, x_j) = |\langle \Phi _{\text {DLP}}^q(x_j) | \Phi _{\text {DLP}}^q(x_i)\rangle |^2\), can therefore be interpreted as the interval intersection of the two states in log-space. Guarantees of the inability to classically estimate these kernel elements up to additive error follow identically to proofs supplied in Liu et al. (2021).

Now, to show that there exist advantages of using a QRF, we construct a different concept class, \(\mathscr {C}_{\text {QRF}}= \{f_{s',s''}^{\text {QRF}}\}_{s', s'' \in \mathbb {Z}_p^*}\) where we have

$$\begin{aligned} f_{s',s''}^{\text {QRF}}(x) = {\left\{ \begin{array}{ll} +1 &{} \text {if} \log _g x^{(0)} \in [s', s'+\frac{p-3}{2}] \text { and } \log _g x^{(1)} \notin [s'', s''+\frac{p-3}{2}]\\ +1 &{} \text {if} \log _g x^{(0)} \notin [s', s'+\frac{p-3}{2}] \text { and } \log _g x^{(1)} \in [s'', s''+\frac{p-3}{2}]\\ -1 &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(B56)

and where \(x=(x^{(0)}, x^{(1)})\in \mathbb {Z}_p^*\times \mathbb {Z}_p^*\) for generator g of \(\mathbb {Z}_p^*\). This can be interpreted as a separation of a torus into four segments as shown in Fig. 7b. The associated quantum feature map can be constructed through the tensor product, \(|\Phi _{\text {DLP}^{\otimes 2}}^q(x_i)\rangle = |\Phi _{\text {DLP}}^q(x_i^{(0)})\rangle \, |\Phi _{\text {DLP}}^q(x_i^{(1)})\rangle \) over \(n=2\lceil \log _2 p \rceil \) qubits. This gives a kernel function of the form, \(k(x_i, x_j) = |\langle \Phi _{\text {DLP}}^q(x_j^{(0)}) | \Phi _{\text {DLP}}^q(x_i^{(0)})\rangle |^2\cdot |\langle \Phi _{\text {DLP}}^q(x_j^{(1)}) | \Phi _{\text {DLP}}^q(x_i^{(1)})\rangle |^2\) which by Theorem 2 is a valid product kernel of DLP kernels. The interpretation of this kernel function is now a 2D interval (area) in log-space.

Since concepts \(\mathscr {C}_{\text {QRF}}\) can not be learned with a single hyperplane in log-space, a quantum kernel with a \(\Phi _{\text {DLP}^{\otimes 2}}^q\) embedding will be unable to learn this concept. Classical methods also fail due to the assumed hardness of the DLP problem. In comparison, a QDT with its tree structure is able to construct multiple decision boundaries to precisely determine the regions of different classes. In practice, there is a high likelihood of any single QDT overfitting to the given samples—hence requiring an ensemble of QDTs to form a QRF model.

Finally, it is important to note that we are only claiming that the concept class \(\mathscr {C}_{\text {QRF}}\) is unlearnable by QSVMs and potentially learnable by a QRF. There clearly exist a hypothesis \(h_c \in \mathbb {H}_{\text {QDT}}\) that emulates a particular concept \(c\in \mathscr {C}_{\text {QRF}}\). However, this does not say anything about the ability of the QRF algorithm to obtain \(h_c\) with a particular training set sampled from c. Hence, there remains the question of whether there exists an evaluation algorithm \(\mathcal {A}\), such that given a training set \(\mathcal {S}_c = \{x_i, c(x_i)\}_{i=1}^N\), we have \(\mathcal {A}(\mathcal {S}_c)|_{x=x'}=h(x')\approx c(x')\) for some predefined \(c\in \mathscr {C}_{\text {QRF}}\).

1.8 8.   Limitations of quantum kernel methods

Recent work (Jäger and Krems 2022) has shown that quantum kernels can be constructed to present quantum advantage using any BQP decision problem; however, there is a lack of evidence that it provides any benefit when dealing with real-world data. Though they are crucial milestones in the development of QML, such constructions—including the advantage shown in Liu et al. (2021)—embed solutions to the learning problem with the specific embedding employed. It is hard to see the set of model hypotheses, arising from the DLP embedding (Liu et al. 2021), be useful in any other problem disregarding the specific example it was designed for. Therefore, as was required in the development of effective classical kernels, further research is required to construct relevant embeddings for practical problems (Kübler et al. 2021).

Obtaining appropriate embeddings, one needs to also consider the curse of dimensionality (Bellman et al. 1957). With greater dimensions, a larger number of samples of data points are required to understand patterns in this space. Though obtaining a separation between sparsely distributed points is generally easier (Cover’s theorem (Cover 1965)), they are far more prone to overfitting. Therefore, generalisability seems to be unlikely with the exponential growth of the Hilbert space with respect to the number of qubits. Furthermore, as the dimensions increase, points are sent to the peripheries of the feature space, resulting in the inner product becoming meaningless in high dimension. This results in kernel matrix elements being exponentially (with respect to the number of qubits) suppressed to zero. Hence, each kernel element would require an exponentially large number of samples to well-approximate the vanishing kernel elements. Though it was seen that the Nyström approximation allowed for a smaller complexity of shots to bound the error in the kernel matrix, \(M\sim NL\) (see Section B 5), this does not take into account a vanishing kernel element. Bernoulli trials have variance, \(\frac{(1-k)k}{M}\) for kernel element k. This means that for \(k\sim 2^{-n}\), we are required to have \(M\sim \mathcal {O}(2^n)\) to distinguish elements from zero. This was discussed in Peters et al. (2021), and specific embeddings were constructed so that kernel elements did not vanish with larger numbers of qubits. To further aid with this problem, reducing the bandwidth parameter of quantum kernels is seen to improve generalisation (Canatar et al. 2022; Shaydulin and Wild 2022). In essence, both these methods work to reduce the volume of Hilbert space explored, resulting in larger inner products.

Finally, there are also the practical questions of the effect of noise and finite sampling error on the kernel machine optimisation. The SVM problem is convex; however, this is not true once errors are incorporated. In Liu et al. (2021), the effect of such errors to the optimal solution is explored. It is shown that the noisy halfspace learning problem is robust to noise with \(M\sim \mathcal {O}(N^4)\). However, this is an extremely large number of shots that very quickly becomes impractical for even medium-size datasets. It is crucial that work is done to reduce this complexity if we are to see useful implementations of kernel methods. In the case of the QRF, we welcome weak classifiers, and there are arguments for such errors being a natural regulariser to reduce overfitting (Heyraud et al. 2022).

Appendix 3.   Summary of variables

There are many moving parts and hyperparameters to the QRF model. We therefore take the chance to list out most of the parameters used throughout the paper. Hopefully, this saves many the headache of scavenging for definitions.

Parameter list

\(\mathscr {C}\)

Concept class. See Section B7 for concept class based on DLP.

C

Penalty parameter for fitting SVM, \(C\propto 1/\lambda \).

\(\mathcal {C}\)

The set of classes.

d

Maximum depth of QDT.

D

Dimension of the original classical feature space, \(x_i \in \mathbb {R}^D\) or equivalently \(D=\text {dim}(\mathcal {X})\).

\(\mathcal {D}\)

Indicates a specific dataset with \(\mathcal {S}=\{(x_i , y_i)|(x_i , y_i)\sim \mathcal {D}\}_{i=1}^N \) forming the training set.

\(G_{KW}\)

Geometric difference between kernels K and W—see discussion in Section B 6 b.

\(\mathcal {H}\)

Hilbert space. \(\mathcal {H}_\Phi \) refers to the quantum feature space induced by the embedding \(\Phi \).

\(\mathbb {H}\)

Hypothesis set - see Section A 3.

\(k(x',x'')\)

Kernel function.

K

Gram matrix or also referred to as the kernel matrix, \(K_{ij}=k(x_i, x_j)\).

L

Number of landmark points chosen for Nyström approximation – see Section A 4.

M

Number of samples taken from each quantum circuit (shots).

n

Number of qubits.

N

Total number of training instances.

\(\mathcal {N}_{\Phi , L}^{(i)}\)

Denotes \(i^\text {th}\) node with embedding \(\Phi \) and L landmark points.

\(\textbf{N}(\cdot )\)

Nyström feature map. See definition in Eq. A49.

\(\mathcal {R}(h)\)

Generalisation error for hypothesis h – see Def. 4. \(\widehat{\mathcal {R}}(h)\) refers to the empirical error.

\(\mathfrak {R}_{\mathcal {S}}\)

Rademacher complexity over dataset \(\mathcal {S}\). The Empirical Rademacher complexity is then given by \(\widetilde{\mathfrak {R}}_{\mathcal {S}}\), see Section A 3.

\(\mathscr {R}_\Phi \)

Relabelling function for a given embedding \(\Phi \). Can either be \(\mathscr {R}^{\text {QK}}_\Phi \) or \(\mathscr {R}^{\text {QRF}}_\Phi \) with explanations in Sections B 6 b and B 6 c respectively.

\(s_K\)

Model complexity for kernel K defined as \(s_K = Y^\top K^{-1}Y\) for given training labels Y.

\(\mathcal {S}^{(i)}\)

Training set available at node i. \(\mathcal {S}_{\text {tot}}\) refers to all data points (including training and testing) from a given dataset. \(\mathcal {S}|_{\mathcal {X}}\) refers to the data vectors.

T

Number of trees in the QRF.

\(\mathcal {T}(K)\)

Kernel target alignment measure given Gram matrix K. Quantity defined in Eq. B49.

\(\lambda \)

Regularisation parameter for fitting SVM, \(\lambda \propto 1/C\).

\(\Phi \)

Indicates the quantum embedding and takes the form \(\Phi _{\text {IQP}}\) or \(\Phi _{\text {Eff}}\) defined in Section B 2.

\(\gamma _i\)

Margin at node labelled i.

\(F^{(i)}(\cdot )\)

Pseudo class map that defines the binary class for the splitting at node i. Refer to Section B 3.

\(\mathcal {X}\)

Feature space.

\(\mathcal {Y}\)

The set of class labels.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srikumar, M., Hill, C.D. & Hollenberg, L.C.L. A kernel-based quantum random forest for improved classification. Quantum Mach. Intell. 6, 10 (2024). https://doi.org/10.1007/s42484-023-00131-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42484-023-00131-2

Keywords

Navigation