Quantum Gaussian Process Regression for Bayesian Optimization

Gaussian process regression is a well-established Bayesian machine learning method. We propose a new approach to Gaussian process regression using quantum kernels based on parameterized quantum circuits. By employing a hardware-efficient feature map and careful regularization of the Gram matrix, we demonstrate that the variance information of the resulting quantum Gaussian process can be preserved. We also show that quantum Gaussian processes can be used as a surrogate model for Bayesian optimization, a task that critically relies on the variance of the surrogate model. To demonstrate the performance of this quantum Bayesian optimization algorithm, we apply it to the hyperparameter optimization of a machine learning model which performs regression on a real-world dataset. We benchmark the quantum Bayesian optimization against its classical counterpart and show that quantum version can match its performance.

Quantum computers are expected to have a profound impact on numerous areas in science and industry.The ongoing progress of quantum computing hardware [1][2][3] is accompanied by intense algorithmic research activities which explore avenues towards achieving a quantum advantage beyond proof-of-principles [4,5].Quantum machine learning combines quantum computing and machine learning and is often deemed as one of the fields that could benefit from quantum computing early [6].While some quantum machine learning methods rely on running quantum versions of linear algebra sub-routines for a speed-up [7][8][9], these methods usually require deep quantum circuits that are beyond the capabilities of currently accessible noisy intermediate-scale quantum (NISQ) hardware [10].
Recently, quantum kernel methods have received much attention.These methods are appealing because they can be studied using the well established toolbox of classical kernel theory [11,12].Furthermore, using a suitable feature map, they can be implemented on available NISQ devices [13].The general idea is to project the data into the Hilbert space of a quantum computer using a quantum feature map.By calculating pair-wise inner products of data points, a kernel matrix can be calculated which can then be used in classical methods such as support vector machines or kernel ridge regression [14,15].The expectation is that by encoding the data into a quantum Hilbert space, the feature map can be enriched with non-classical resources that provide an advantage compared to classical feature maps.This has already been demonstrated for tailored datasets [6,16].
While quantum versions of kernel machines like the support vector machine [8] have been the focus of recent studies, quantum variants of probabilistic kernel methods have not received as much attention.In this work, we use quantum kernels to create quantum Gaussian processes (QGP).Gaussian process (GP) models are popular machine learning methods based on Bayesian inference.GPs are specified by a covariance matrix which can be obtained by calculating the Gram matrix of a kernel function for a given dataset.Given their probabilistic nature, GPs have the desirable property of providing a variance for their predictions which allows uncertainty quantification.
Earlier investigation of QGPs have focused on using quantum approximations of classical kernels and have raised the question whether the variance information can be retained in noisy near-term devices [17].Here, we investigate QGP regression using a hardware-efficient, parameterized feature map.We demonstrate that careful regularization of the Gram matrix can help preserve the variance and show how the overall performance can be improved with an end-to-end optimization using log-likelihood optimization.We show the capabilities of the QGP model by using it as surrogate model for a Bayesian optimization (BO) [18], a task that critically relies on the variance information of the surrogate model.We benchmark the resulting quantum Bayesian optimization (QBO) against optimizations using a surrogate models based on conventional GPs and show that QBO can match their performance on the task of optimizing the multidimensional hyperparameters of a classical machine learning model.The hyperparamter optimization is performed on a regression task of a real-world dataset which evaluates the remaining value of used industrial machinery.Figure 1 gives an overview of the various components used in this work.
The manuscript is structured as follows.In Sec. 1, we provide an introduction to the fundamentals of QGPs by briefly discussing GP theory and exploring quantum kernels.Subsequently, we illustrate the concept of quantum BO using a QGP surrogate model.In Sec. 2, we demonstrate the versatility and effectiveness of QGP models through our analysis of a one-dimensional dataset, followed by their successful application in QBO for the purpose of minimizing a multidimensional function and identifying the optimal hyperparameters of a machine learning model.We present the results of our simulations, including those obtained from noiseless and sample-based experiments, as well as the outcomes from a real quantum computing backend.

Quantum Gaussian Process Regression
Gaussian process regression is a non-parametric Bayesian machine learning method [19].It can be used to solve a regression problem of the form where f (x) is a data generating function, with labels y ∈ R, observed data x ∈ X ⊂ R d and independent zero-mean Gaussian noise ∼ N (0, σ 2 ).If f is a random function with a Gaussian prior distribution then the function values can be taken as random variables that form a Gaussian process (GP).We denote the GP as GP(m, k) with a mean function m and a covariance function k.Note that k is mathematically equivalent to a kernel function, we will therefore refer to it as kernel in the following.A GP is a collection of random variables such that any finite subset is Gaussian distributed [19].Concretely, for collection of data points X := (x 1 , . . ., x n ) , x i ∈ X the variables f (x i ) n i=1 are jointly distributed by a multivariate Gaussian distribution such that GPs are thus distributions over functions specified by the covariance k [20].
To predict the values f * of new data points X * (test points), we can calculate the posterior distribution given X and GP regression thus not only yields a prediction for the mean µ * but also for the covariance Σ * .They are given by The elements of the Gram matrices k XX , k XX * and k X * X * are the pair-wise inner products of the training points, the training and test points, and the test points, respectively.Note that here we have assumed that we only have access to noisy labels as in Eq. ( 1).The variance of this noise can be explicitly taken into account in the calculation of the mean and the variance.This servers as an implicit regularization which often results in a better conditioned posterior covariance matrix.Equations ( 2)- (4) show that the outcome of the GP is fully governed by the choice of the kernel.In general, a kernel is a positive definite function k : χ × χ → R, which serves as a similarity measure between pairs of inputs x and x .Specifically, the kernel computes the inner product of the corresponding feature vectors φ(x) and φ(x in a potentially high-dimensional feature space F, where the feature map φ(x) is a non-linear map from the input space χ to the feature space F.

Quantum kernels
Kernels can be constructed by embedding data into the Hilbert space of a quantum system [13,21] [see Fig. 1(a)].The resulting quantum state is The unitary operator U (x; θ) implements the quantum feature map quantum feature map φ.It encodes the classical data point x into a quantum state.In principle, it can depend on additional parameters θ that can be trained variationally [22].
Using the feature map in Eq. ( 6), a quantum kernel can be defined in terms of the Hilbert-Schmidt inner product with the density matrix ρ It can be shown that this definition results in a positive definite kernel [12].For pure states, Eq. ( 7) reduces to the overlap between the states encoding the data points such that in practice the kernel elements can be calculated by applying the feature map and its inverse to x and x and measuring the occupation of the ground state (8) From this it becomes clear that the defining quantity for a quantum kernel k is the quantum feature map φ.The choice of an optimal embedding strategy is an open research question such that the feature maps are often chosen heuristically.Finally, to obtain a quantum GP, we substitute a quantum kernel Eq. ( 7) into the definition of the variance of a GP model Eq. ( 4).This is illustrated in Fig 1(a).
The variational parameters in Eq. ( 6) can be trained using various methods.Popular approaches for quantum kernel machines such as quantum support vector machines or quantum kernel ridge regression often optimize the kernel directly using, e.g., kernel alignment techniques [22][23][24].In this work we make use of the Bayesian framework of GPs and train the QGP model end-to-end by maximizing the marginal log-likelihood.Due to the Gaussian form of the posterior [cf.Eq. ( 2)], the marginal log-likelihood can be given in closed form [19] log p(y|X Here, k XX (θ) indicates the dependence of the kernel on the parameters θ through the parameterized feature map.The optimization workflow is sketched in Fig. 1(a).Optimizing parameterized quantum circuit is an active area of research with open questions such as how to avoid barren plateaus during training [25].
In practice, the kernel elements in Eq. ( 8) can only be computed approximately because any observable has to be be determined using a finite amount of measurements.The resulting statistical error scales as O(1/ √ N ) where N is the number of measurements.In addition, available NISQ devices suffer from a multitude of noise sources such as short coherence times, gate errors and cross-talk.As a result the estimated kernel k deviates from the true kernel k.To ensure that k is positive definite, we need to apply regularization techniques.Taking account of the variance for noisy objective functions as done in Eq. ( 3)-( 4) already serves as an inherent regularization.Nevertheless, for noiseless objective functions or for noisy estimates k, this might not be sufficient to ensure positive definiteness.Therefore, we employ an eigenvalue-cutoff strategy, where the spectrum of the full Gram matrix is truncated at zero [26].This requires a full eigenvalue decomposition of the Gram matrix followed by a reconstruction using the truncated spectrum and the original eigenvectors [22].This technique has already been shown to provide good results [27].Additionally, compared to other methods such as shifting the spectrum by the lowest eigenvalue, the truncation does not introduce a constant offset to the variance of the GP model which is desirable for applications where the quantification of uncertainty is required.In general, the regularization of Gram matrices used for GP regression is problem-specific and non-trivial, even for classical kernels [28].
In this work, we are interested in using QGP models as surrogate models in Bayesian optimization.This is explained in the next section and illustrated in Fig. 1(b).

Quantum Bayesian Optimization
Bayesian optimization [30] is a global optimization method that solves problems of the form The optimization is performed iteratively where the next sample is chosen using information obtained from previous iterations.Through this informed guidance, BO usually requires a modest amount of samples which makes it attractive for problems where the evaluation of g is expensive.BO treats g as a black-box such that there are no further restrictions regarding its functional form.The algorithm is initialized by drawing a random sample and fitting a surrogate model as a proxy for g.The next sample is then chosen by considering an exploitation-exploration trade-off which is quantified by an acquisition function.This procedure is then repeated such that the surrogate model approximates the true function increasingly well.Due to their posterior variance output GP models are popular choices for surrogates.A common choice for an acquisition function is the expected improvement (EI) [18] which measures the expectation of the improvement on the objective g(x) with respect to the predictive distribution of the surrogate model.The EI function is given by and EI = 0 for Σ(x) = 0.Here µ(x) and Σ(x) are the posterior mean prediction and the prediction uncertainty of the surrogate model at position x, and ϕ(Z), and Φ(Z) are the probability distribution and the cumulative distribution of Fig. 2 Example of the hardware efficient feature map with q = 4 qubits and l = 1 layers, inspired by a Chebychev quantum feature map design [29].The trainable parameters are denoted by θ i and the data points by x.For the results in this work, various values of q and l are used.
the standard normal distribution.The location of the best sample, i.e., the current observed minimum of the surrogate model, is indicated by x + .The standardized prediction error Z is given by The parameter λ in Eq. ( 11) is a hyperparameter that controls the exploitationexploration trade-off, where a high value of λ favours exploration.We obtain a quantum Bayesian optimization (QBO) algorithm by using a QGP model as a surrogate model.This has the potential to enhance BO for scenarios where quantum kernels have an advantage over classical kernels.A possible drawback is that the exploitation-exploration trade-off, which depends on the model variance is now influenced by quantum computing noise sources.To demonstrate the QBOs capabilities, we apply it to several test cases which is shown in the next section.

Results
We illustrate the capabilities of QGP models on a one dimensional regression problem.We then demonstrate the feasibility of using QBO with a QGP surrogate model on two multidimensional optimization tasks.The quantum circuits for the QGP models are implemented using Qiskit [31].The linear systems for the GPs are solved using a Cholesky decomposition of the Gram matrices.We validate the algorithm using numerical simulators provided by Qiskit.Results from real quantum computers are obtained from ibmq montreal [32].

Quantum Gaussian Process Regression
We apply QGP regression on a one dimensional dataset where the data generating function [cf.Eq.( 1)] is We assume that only noisy labels y can be observed with zero-mean Gaussian noise with a variance σ 2 = (0.1) 2 [cf.Eq. ( 1)] .We sample n training = 23 non-equidistant training-points in the interval [0, 2π], and n test = 50 equidistantly-spaced test points.
The quantum kernel is calculated using a hardware-efficient feature map with variational parameters θ as depicted in Fig. 2. 1 We encode the data using q = 4 qubits and l = 2 layers.To account for the limited domain of the non-linearity in the feature map, the labels y are scaled to the interval [−1, 1].
To gauge the performance of the model under ideal conditions, we perform statevector simulations from which we obtain completely noiseless quantum kernels.The regression result can be seen in Fig. 3(a) where the mean prediction of the model is shown as a solid line and the standard deviation is depicted as a shaded area.Overall the method is able to achieve a good fit a is visible in the figure.The standard deviation that is obtained from the QGP variance has a reasonable behaviour and is low in areas with high training point density and high in ares where training points are lacking.Although good results can already be achieved using a general feature map, e.g., by choosing the parameters θ randomly [33,34], we adapt the kernel to the dataset using maximum-likelihood optimization (cf.Eq. ( 9) and surrounding discussion).The marginal log-likelihood as a function of optimization iterations can be seen in Fig. 4. In this example, the optimization leads to a reduction of the mean squared error (MSE) by about an order of magnitude [from 0.3 (R 2 = 0.939) to 0.02 (R 2 = 0.996)] We observe a convergence of the marginal log-likelihood after ∼ 80 iterations.The specific optimization behavior is dependent on the chosen feature map design such as the number of qubits, layers and variational parameters.We use the optimal parameters obtained from these ideal simulation for subsequent noisy simulations and calculations on real quantum computers.
Any real quantum computation is ultimately affected by statistical errors.Figure 3(b) shows results of the same simulation as in Fig 3(a) with sample-based estimation of the wavefunctions with a modest amount of N = 10, 000 measurements per evaluation point.These kind of simulations are a good indicator of the future performance of the model in a regime with low hardware noise.Due to the statistical error in this simulation, the kernel is now only a noisy estimate k of the true kernel k.As can be seen in the figure, the performance of the model is only slightly worse compared to the ideal simulation (MSE = 0.024).Particularly, due to careful regularization of the Gram matrix (cf.Sec.1.1) the variance information can be retained reasonably well.
We conclude this example by running the QGP regression on real quantum hardware using the ibmq montreal device.The results are shown in Fig. 3(c).We use readout error mitigation [35], and dynamical decoupling [36] to mitigate the hardware errors.Compared to the simulations, the performance of the model slightly decreases with the method obtaining an erorr of MSE = 0.114 on the test data.Nevertheless, the mean prediction only marginally deviates from the true function.As expected, the regularization of the quantum kernel matrices has to be increased such that the overall standard deviation increases.Nevertheless, even on the real quantum computer the variance of the standard deviation of the prediction can still be retained such that one can clearly distinguish between areas of high and low uncertainty.This is a substantial improvement compared to previous results [17].
The quality of the solution and the posterior variance are dependent on the chosen quantum feature map.Appendix B shows results for the same dataset using a different feature map and a different quantum computer.

Quantum Bayesian Optimization
We asses the QBO routine introduced in Sec.1.2 by minimizing the two-dimensional Branin-Hoo function where a, b, c, s, t are real parameters and x 1 ∈ [−5, 10], x 2 ∈ [0, 15].We fix the parameters such that the function has three global minima (cf.caption of Fig. 5).We substitute Eq. ( 13) to into Eq.( 1) to generate data with zero mean Gaussian noise with a variance of σ 2 = (0.5) 2 .
The hardware efficient feature map illustrated in Fig. 2 is utilized for the QGP model which is used as a surrogate model for the QBO.We encode the two-dimensional input vector with q = 4 qubits which increases the model's expressibility compared to a single encoding [37].Every parameter θ in the feature map is sampled uniformly from the interval [0, 2π] and kept fixed for the duration of the optimization.
Figure 5(a) shows the results for statevector (red line) and sample-based simulations (blue line) where the optimization has been averaged over 25 runs.The resulting standard deviation of the respective simulations is depicted as shaded areas.It can be seen that both, the BO using kernels obtained from the noiseless and the noisy simulations converge to the true minimum of the function.Especially for the sample-based simulation, this requires thoughtful regularization of the quantum Gram matrices.We compare the performance of the QBO routines to a classical BO with a GP using an RBF kernel.The RBF kernel is optimized in each iteration using maximum likelihood estimation.Despite this optimization which is not used by the QBO it can be seen that the classical and the quantum models perform comparably well.
To demonstrate the applicability of QBO to a real-world scenario, we use the algorithm to optimize the hyperparameters ξ of a gradient boosting model h(x, ξ) [38] that is applied to a regression task as illustrated in Fig. 1(c).The gradient boosting model is used to predict the price of industrial machinery with respect to different machine types, specifications, and amount of working hours.In total, the dataset contains 2910 data points, and the one-hot encoding of the categorical features leads to 65 features in total.Further details are shown in Appendix A. For the optimization, we fix the categorical hyperparameters of the gradient boosting model and only optimize the five continuous hyperparameters (cf.Table 1).The objective function for the QBO is the cross validated MAE of the gradient boosting model on the training dataset for a given set of hyperparameters.
We encode the five dimensional hyperparameter vector with the feature map in Fig. 2 using q = 10 qubits and l = 2 layers.Figure 5(b) depcits the result for the different BO runs.Additionally, a random search is shown for comparison.As in the previous example, the QBO results are compared to a BO with a classical GP with an optimized RBF kernel.It can be seen that the results of the QBO are on par with the results of the classical BO.This is true for both, the statevector and the samplebased simulations.As expected, all BO approaches outperform the random search on average.

Discussion
In this study, we apply QGP models to one and multi-dimensional regression problems and show that they can be used as a surrogate model for BO to create a QBO.We demonstrate that QBO can  be used to solve real-world hyperparameter optimization problems.Our encoding strategy allows for effectively using the variational parameters of the data embedding circuit as hyperparameters for the quantum kernels.In our simulations, we observe that the posterior variance of the QGP remains intact under the influence of samplingnoise and even for the calculation NISQ devices, although the influence of the various error sources in the latter affect the result.Nevertheless, since the results from the sampling-based simulations can be seen as an upper-bound for future hardware capabilities, the outlook is optimistic.
Although we demonstrate the feasibility of using QBO to optimize hyperparameters of a machine learning model, the potential benefits of employing quantum kernels over classical machine learning methods in tasks using classical data remain uncertain [39].However, it is reasonable to expect that QBO may provide advantages in problems where quantum data can be leveraged to achieve a quantum advantage [16].Notably, QBO is potentially well-suited for active learning tasks in expensive molecular simulations, where the evaluation of the potential energy surface is based on quantum mechanics and is computationally expensive [40,41].
The performance of the QGP model remains unexplored in several avenues within this work.For example, the choice of feature map is a crucial aspect and it has been shown that choosing problem-specific feature map with an inductive bias that is tailored to the dataset has various advantages such as improved performance and trainability [23,42].It is also known that using parametrized feature maps require special care when scaling the number of qubits which can lead to exponential concentration [44].
Moreover, in this work, we use fidelity-based kernels for the QGP.These have an unfavorable quadratic scaling with the size of the dataset as the pair-wise inner product of the data points have to be calculated.An alternative approach would be to use projected quantum kernels as proposed in [43] which not only have a linear scaling but also are thought to have beneficial properties when the dimension of the feature space increases significantly.These alternative kernels could easily be integrated in the QGP and analyzed in future studies.
While the QGP models presented in this work feature a quantum calculation of the kernel, the majority of their operations are performed classically.However, there is potential for increased improvements by creating a fully quantum QGP with a quantum kernel and employing HHL-based inversion of the covariance matrix [7,9].Such an approach could leverage the benefits of both quantum kernels and quantum linear algebra subroutines, which would help overcome today's limitation of GP models which are currently affected by an unfavorable scaling with the size of the dataset.

2 𝑥Fig. 1
Fig. 1 Conceptual layout of the workflow used in this work.(a) The QGP model is constructed by calculating a quantum kernel and substituting the corresponding Gram matrix as covariance matrix into a classical GP.If the feature map used for the quantum kernel contains variational parameters, they can be optimized using maximum likelihood estimation [Eq.(9)].(b) By using a QGP model as a surrogate model for Bayesian optimization, a QBO can be obtained.(c) In Sec.2.2, the QBO algorithm is used to optimize the hyperparameters ξ of a gradient boosting model h(x, ξ) which performs regression on a dataset for remaining value estimation of industrial machines.

Fig. 3
Fig.3QGP regression on a dataset created using Eq.(12) (black line).The results are obtained using the feature map in Fig.2with q = 4 qubits and l = 2 layers for the encoding, n training = 23 training points, shown as the blue crosses.The test points are marked by the red dots.The posterior mean of the QGP is shown as the red-line and the standard deviation as the shaded area.(a) shows the result of the statevector simulation with optimized parameters, obtaining an R 2 score of 0.996 and an MSE = 0.022.(b) shows the result of the sample-based simulation.We use the optimal parameters obtained in the previous ideal run, resulting in an R 2 score of 0.996 and an MSE = 0.024.(c) shows the result of the real hardware run, using the ibmq montreal backend, leading to an R 2 score of 0.978 and an MSE = 0.114.All runs use the same parameters.

Fig. 4
Fig.4Convergence plot of the log-likelihood loss function [cf.Eq. (9)], the loss is entirely evaluated on the training data.The variable parameter of the optimization are the angles θ in the feature map.

Fig. 5
Fig.5BO results averaged over independent runs with the mean shown as solid lines and the variance as shades.The expected improvement [Eq.(11)] is used as acquisition function with an exploration-exploitation parameter of λ = 0.1 The classical BO uses a GP surrogate model with an optimized RBF kernel (black line).The QBO results are obtained with the feature map in Fig.2using statevector (red line) and sample-based simulations (blue line).The the initial samples for each individual run are the same for the quantum and classical QBO for better comparison.At each iteration, only the best current result is shown.(a) shows the result for the minimization Eq. (13) where the parameters are fixed at a = 1, b = 5.1/(4π 2 ), c = 5/π, r = 6, s = 10 and t = 1/8π.The feature map for the QBO uses q = 4 qubits and l = 2 layers.The results are averaged over 25 runs.(b) shows the result of the hyperparameter optimization of a gradient boosting model on a industrial dataset.The average result of ten iterations of random search runs is shown (green, solid).The kernel is calculated using q = 10 qubits and l = 2 layers.

Fig
Fig. B1 Optimized QGP regression results to regress Eq. (12) (black line).Using n training = 23 training points, shown as the blue crosses, ntest = 50 test points, marked by the red dots.The posterior mean of the QGP is shown as the red-line and the standard deviation as the shaded area.(a) shows the result of the ideal simulation with optimized parameters, obtaining an R 2 score of 0.986 and an MSE = 0.07.(b) shows the result with sampling noise.We use the optimal parameters obtained in the previous ideal run, resulting in an R 2 score of 0.987 and an MSE = 0.069.(c) shows the of the real hardware run, using the ibmq ehningen backend, leading to an R 2 score of 0.951 and an MSE = 0.261.All runs use the same parameters.
Fig.B2Convergence plot of the log-likelihood loss function [cf.Eq. (9)], the loss is entirely evaluated on the training data.The variable parameter of the optimization are the angles θ in the feature map.

Table 1
Hyperparameter space of the gradient boosting model.