Quantum Gaussian process regression for Bayesian optimization

Rapp, Frederic; Roth, Marco

doi:10.1007/s42484-023-00138-9

Quantum Gaussian process regression for Bayesian optimization

Research Article
Open access
Published: 30 January 2024

Volume 6, article number 5, (2024)
Cite this article

Download PDF

You have full access to this open access article

Quantum Machine Intelligence Aims and scope Submit manuscript

Quantum Gaussian process regression for Bayesian optimization

Download PDF

Frederic Rapp¹ &
Marco Roth¹

1282 Accesses
2 Altmetric
Explore all metrics

Abstract

Gaussian process regression is a well-established Bayesian machine learning method. We propose a new approach to Gaussian process regression using quantum kernels based on parameterized quantum circuits. By employing a hardware-efficient feature map and careful regularization of the Gram matrix, we demonstrate that the variance information of the resulting quantum Gaussian process can be preserved. We also show that quantum Gaussian processes can be used as a surrogate model for Bayesian optimization, a task that critically relies on the variance of the surrogate model. To demonstrate the performance of this quantum Bayesian optimization algorithm, we apply it to the hyperparameter optimization of a machine learning model which performs regression on a real-world dataset. We benchmark the quantum Bayesian optimization against its classical counterpart and show that quantum version can match its performance.

Feature selection on quantum computers

Article Open access 20 February 2023

Supercomputing leverages quantum machine learning and Grover’s algorithm

Article 17 November 2022

Quantum Computing for Machine Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Quantum computers are expected to have a profound impact on numerous areas in science and industry. The ongoing progress of quantum computing hardware (Madsen et al. 2022; Arute et al. 2019; Bruzewicz et al. 2019) is accompanied by intense algorithmic research activities which explore avenues towards achieving a quantum advantage beyond proof-of-principles (Bravyi et al. 2020; Biamonte et al. 2017). Quantum machine learning combines quantum computing and machine learning and is often deemed as one of the fields that could benefit from quantum computing early (Liu et al. 2021). While some quantum machine learning methods rely on running quantum versions of linear algebra subroutines for a speed-up (Harrow et al. 2009; Rebentrost et al. 2014; Zhao et al. 2019), these methods usually require deep quantum circuits that are beyond the capabilities of currently accessible noisy intermediate-scale quantum (NISQ) hardware (Preskill 2018).

Recently, quantum kernel methods have received much attention. These methods are appealing because they can be studied using the well-established toolbox of classical kernel theory (Schölkopf et al. 2001; Schuld 2021). Furthermore, using a suitable feature map, they can be implemented on available NISQ devices (Havlíček et al. 2019). The general idea is to project the data into the Hilbert space of a quantum computer using a quantum feature map. By calculating pair-wise inner products of data points, a kernel matrix can be calculated which can then be used in classical methods such as support vector machines or kernel ridge regression (Schölkopf et al. 2002; Vovk 2013). The expectation is that by encoding the data into a quantum Hilbert space, the feature map can be enriched with non-classical resources that provide an advantage compared to classical feature maps. This has already been demonstrated for tailored datasets (Liu et al. 2021; Huang et al. 2022).

While quantum versions of kernel machines like the support vector machine (Rebentrost et al. 2014; Havlíček et al. 2019) have been the focus of recent studies, quantum variants of probabilistic kernel methods have not received as much attention in the NISQ context. In this work, we use quantum kernels to create quantum Gaussian processes (QGP). Gaussian process (GP) models are popular machine learning methods based on Bayesian inference. GPs are specified by a covariance matrix which can be obtained by calculating the Gram matrix of a kernel function for a given dataset. Given their probabilistic nature, GPs have the desirable property of providing a variance for their predictions which allows uncertainty quantification.

Earlier investigation of QGPs includes HHL-based algorithms which require fault-tolerant quantum computers (Chen et al. 2022; Zhao et al. 2019) and methods that rely on classically evaluated quantum kernels (Smith et al. 2023). Additionally, quantum approximations of classical kernels have been studied and have raised the question whether the variance information can be retained in noisy near-term devices (Otten et al. 2020). In this work, we present a NISQ-ready QGP model which uses a hardware-efficient, parameterized feature map. We show that this model can be used in practically relevant scenarios. To this end, we demonstrate that careful regularization of the Gram matrix can help preserve the variance and show how the overall performance can be improved with an end-to-end optimization using log-likelihood optimization. We show the capabilities of the QGP model by using it as a surrogate model for a Bayesian optimization (BO) (Archetti and Candelieri 2019), a task that critically relies on the variance information of the surrogate model. We benchmark the resulting quantum Bayesian optimization (QBO) against optimizations using a surrogate model based on conventional GPs and show that QBO can match their performance on the task of optimizing the multidimensional hyperparameters of a classical machine learning model. The hyperparameter optimization is performed on a regression task of a real-world dataset which evaluates the remaining value of used industrial machinery. Figure 1 gives an overview of the various components used in this work.

The manuscript is structured as follows. In Section 2, we provide an introduction to the fundamentals of QGPs by briefly discussing GP theory and exploring quantum kernels. Subsequently, we illustrate the concept of quantum BO using a QGP surrogate model. In Section 3, we demonstrate the versatility and effectiveness of QGP models through our analysis of a one-dimensional dataset, followed by their successful application in QBO for the purpose of minimizing a multidimensional function and identifying the optimal hyperparameters of a machine learning model. We present the results of our simulations, including those obtained from noiseless and sample-based experiments, as well as the outcomes from a real quantum computing backend.

2 Quantum Gaussian process regression

Gaussian process regression is a non-parametric Bayesian machine learning method (Rasmussen and Williams 2005). It can be used to solve a regression problem of the form

$$\begin{aligned} y = f(\varvec{x}) + \epsilon \,, \end{aligned}$$

(1)

where $f(\varvec{x})$ is a data generating function, with labels $y \in \mathbb {R}$, observed data $\varvec{x} \in \mathcal {X} \subset \mathbb {R}^d$ and independent zero-mean Gaussian noise $\epsilon \sim \mathcal {N}(0,\sigma ^2)$. If f is a random function with a Gaussian prior distribution then the function values can be taken as random variables that form a GP. We denote the GP as $\mathcal{G}\mathcal{P}(m,k)$ with a mean function m and a covariance function k. Note that k is mathematically equivalent to a kernel function, we will therefore refer to it as kernel in the following. A GP is a collection of random variables such that any finite subset is Gaussian distributed (Rasmussen and Williams 2005). Concretely, for collection of data points $X:= (\varvec{x}_1,\dots ,\varvec{x}_n),\varvec{x}_i \in \mathcal {X}$ the variables $f(\varvec{x}_i)_{i=1}^n$ are jointly distributed by a multivariate Gaussian distribution such that

$$\begin{aligned} f(\varvec{x}) \sim \mathcal {N}(m(\varvec{x}),k(\varvec{x}, \varvec{x'}))\,, \end{aligned}$$

GPs are thus distributions over functions specified by the covariance k (Dudley 2002).

To predict the values $f_*$ of new data points $X_*$ (test points), we can calculate the posterior distribution given X and $X_*$

$$\begin{aligned} p(f_* \vert X_*, X, f) = \mathcal {N}(f_* ; \mu _*, \Sigma _*)\,. \end{aligned}$$

(2)

GP regression thus not only yields a prediction for the mean $\mu _*$ but also for the covariance $\Sigma _*$. They are given by

$$\begin{aligned}{} & {} \mu _* = k_{X X_*}^T (k_{XX}+ \sigma ^2 \varvec{I})^{-1} f, \end{aligned}$$

(3)

$$\begin{aligned}{} & {} \Sigma _* = k_{X_* X_*} - k_{X X_*}^T (k_{XX}+ \sigma ^2 \varvec{I})^{-1} k_{XX_*}. \end{aligned}$$

(4)

The elements of the Gram matrices $k_{XX}$, $k_{XX_*}$ and $k_{X_* X_*}$ are the pair-wise inner products of the training points, the training and test points, and the test points, respectively. Note that here we have assumed that we only have access to noisy labels as in Eq. 1. The variance of this noise can be explicitly taken into account in the calculation of the mean and the variance. This serves as an implicit regularization which often results in a better conditioned posterior covariance matrix.

Equations 2–4 show that the outcome of the GP is fully governed by the choice of the kernel. In general, a kernel is a positive definite function $k: \chi \times \chi \rightarrow \mathbb {R}$, which serves as a similarity measure between pairs of inputs $\varvec{x}$ and $\varvec{x'}$. Specifically, the kernel computes the inner product of the corresponding feature vectors $\phi (\varvec{x})$ and $\phi (\varvec{x'})$

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = \langle \phi (\varvec{x}),\phi (\varvec{x'}) \rangle _\mathcal {F}, \end{aligned}$$

(5)

in a potentially high-dimensional feature space $\mathcal {F}$, where the feature map $\phi (\varvec{x})$ is a non-linear map from the input space $\chi $ to the feature space $\mathcal {F}$.

2.1 Quantum kernels

Kernels can be constructed by embedding data into the Hilbert space of a quantum system (Schuld and Killoran 2019; Havlíček et al. 2019) (see Fig. 1(a)). The resulting quantum state is

$$\begin{aligned} |\phi (\varvec{x};\varvec{\theta })\rangle = U(\varvec{x};\varvec{\theta })|{0}\rangle \,. \end{aligned}$$

(6)

The unitary operator $U(\varvec{x};\varvec{\theta })$ implements the quantum feature map quantum feature map $\phi $. It encodes the classical data point $\varvec{x}$ into a quantum state. In principle, it can depend on additional parameters $\varvec{\theta }$ that can be trained variationally (Hubregtsen et al. 2021). Using the feature map in Eq. 6, a quantum kernel can be defined in terms of the Hilbert-Schmidt inner product

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = \text {Tr} \left[ \rho (\varvec{x}) \rho (\varvec{x'}) \right] , \end{aligned}$$

(7)

with the density matrix $\rho (\varvec{x})=U(\varvec{x})|0\rangle \!\langle 0| U^\dag (\varvec{x})$. It can be shown that this definition results in a positive definite kernel (Schuld 2021). For pure states, Eq. 7 reduces to the overlap between the states encoding the data points such that in practice the kernel elements can be calculated by applying the feature map and its inverse to $\varvec{x}$ and $\varvec{x'}$ and measuring the occupation of the ground state

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = |\langle \phi (\varvec{x'})|{\phi (\varvec{x})}\rangle |^2 = \big |\langle {0}| U (\varvec{x'})^\dagger U(\varvec{x})|0 \rangle \big |^2. \end{aligned}$$

(8)

Other methods, like the SWAP test, can also calculate the overlap between two states. However, the method above is more efficient in terms of the number of qubits (Hubregtsen et al. 2021).

From Eq. 8 it becomes clear that the defining quantity for a quantum kernel k is the quantum feature map $\phi $. The choice of an optimal embedding strategy is an open research question such that the feature maps are often chosen heuristically. Finally, to obtain a quantum GP, we substitute a quantum kernel Eq. (7) into the definition of the variance of a GP model Eq. (4). This is illustrated in Fig. 1(a).

The variational parameters in Eq. 6 can be trained using various methods. Popular approaches for quantum kernel machines such as quantum support vector machines or quantum kernel ridge regression often optimize the kernel directly using, e.g., kernel alignment techniques (Hubregtsen et al. 2021; Kübler et al. 2021; Glick et al. 2022). In this work we make use of the Bayesian framework of GPs and train the QGP model end-to-end by maximizing the marginal log-likelihood. Due to the Gaussian form of the posterior (cf. Eq. 2), the marginal log-likelihood can be given in closed form (Rasmussen and Williams 2005)

$$\begin{aligned} \begin{aligned} \log p(y\vert X) =&-\frac{1}{2}y^T (k_{XX}(\varvec{\theta })+\sigma ^2 I)^{-1} y \\&- \frac{1}{2} \log \det (k_{XX}(\varvec{\theta })+\sigma ^2 I)\,. \end{aligned} \end{aligned}$$

(9)

Here, $k_{XX}(\varvec{\theta })$ indicates the dependence of the kernel on the parameters $\varvec{\theta }$ through the parameterized feature map. The optimization workflow is sketched in Fig. 1(a). Optimizing parameterized quantum circuit is an active area of research with open questions such as how to avoid barren plateaus during training (McClean et al. 2018).

In practice, the kernel elements in Eq. 8 can only be computed approximately because any observable has to be determined using a finite amount of measurements. The resulting statistical error scales as $\mathcal {O}(1/\sqrt{N})$ where N is the number of measurements. In addition, available NISQ devices suffer from a multitude of noise sources such as short coherence times, gate errors and cross-talk. As a result the estimated kernel $\tilde{k}$ deviates from the true kernel k. To ensure that $\tilde{k}$ is positive definite, we need to apply regularization techniques. Taking account of the variance for noisy objective functions as done in Eqs. 3–4 already serves as an inherent regularization. Nevertheless, for noiseless objective functions or for noisy estimates $\tilde{k}$, this might not be sufficient to ensure positive definiteness. Therefore, we employ an eigenvalue-cutoff strategy, where the spectrum of the full Gram matrix is truncated at zero (Graepel et al. 1998). This requires a full eigenvalue decomposition of the Gram matrix followed by a reconstruction using the truncated spectrum and the original eigenvectors (Hubregtsen et al. 2021). This technique has already been shown to provide good results (Wang et al. 2021). Additionally, compared to other methods such as shifting the spectrum by the lowest eigenvalue, the truncation does not introduce a constant offset to the variance of the GP model which is desirable for applications where the quantification of uncertainty is required. In general, the regularization of Gram matrices used for GP regression is problem-specific and non-trivial, even for classical kernels (Mohammadi et al. 2016).

In this work, we are interested in using QGP models as surrogate models in Bayesian optimization. This is explained in the next section and illustrated in Fig. 1(b).

2.2 Quantum Bayesian optimization

Bayesian optimization (Garnett 2023) is a global optimization method that solves problems of the form

$$\begin{aligned} \varvec{x}^* = \text {arg} \min _{\varvec{x}} g(\varvec{x})\,. \end{aligned}$$

(10)

The optimization is performed iteratively where the next sample is chosen using information obtained from previous iterations. Through this informed guidance, BO usually requires a modest amount of samples which makes it attractive for problems where the evaluation of g is expensive. BO treats g as a black-box such that there are no further restrictions regarding its functional form.

The algorithm is initialized by drawing a random sample and fitting a surrogate model as a proxy for g. The next sample is then chosen by considering an exploitation-exploration trade-off which is quantified by an acquisition function. This procedure is then repeated such that the surrogate model approximates the true function increasingly well. Due to their posterior variance output GP models are popular choices for surrogates. A common choice for an acquisition function is the expected improvement (EI) (Archetti and Candelieri 2019) which measures the expectation of the improvement on the objective $g(\varvec{x})$ with respect to the predictive distribution of the surrogate model. The EI function is given by

$$\begin{aligned} \textrm{EI}(\varvec{x}) = [g(\varvec{x}^+) - \mu (\varvec{x}) - \lambda ] \varvec{\Phi }(Z) + \Sigma (\varvec{x}) \varphi (Z)\,, \end{aligned}$$

(11)

and $\textrm{EI}=0$ for $\Sigma (\varvec{x}) = 0$. Here $\mu (\varvec{x})$ and $\Sigma (\varvec{x})$ are the posterior mean prediction and the prediction uncertainty of the surrogate model at position $\varvec{x}$, and $\varphi (Z)$, and $\varvec{\Phi }(Z)$ are the probability distribution and the cumulative distribution of the standard normal distribution. The location of the best sample, i.e., the current observed minimum of the surrogate model, is indicated by $\varvec{x}^+$. The standardized prediction error Z is given by $Z = [g(\varvec{x}^+) - \mu (\varvec{x}) - \lambda ]/\Sigma (\varvec{x})$ if $\Sigma (\varvec{x}) > 0$ and $Z=0$ if $\Sigma (\varvec{x}) = 0$. The parameter $\lambda $ in Eq. 11 is a hyperparameter that controls the exploitation-exploration trade-off, where a high value of $\lambda $ favors exploration.

We obtain a quantum Bayesian optimization (QBO) algorithm by using a QGP model as a surrogate model. This has the potential to enhance BO for scenarios where quantum kernels have an advantage over classical kernels. A possible drawback is that the exploitation-exploration trade-off, which depends on the model variance is now influenced by quantum computing noise sources. To demonstrate the QBOs capabilities, we apply it to several test cases which is shown in the next section.

3 Results

We illustrate the capabilities of QGP models on a one-dimensional regression problem. We then demonstrate the feasibility of using QBO with a QGP surrogate model on two multidimensional optimization tasks. The quantum circuits for the QGP models are implemented using Qiskit (Qiskit Community 2017). The linear systems for the GPs are solved using a Cholesky decomposition of the Gram matrices. We validate the algorithm using numerical simulators provided by Qiskit. Results from real quantum computers are obtained from ibmq_montreal (Ibm 2021).

3.1 Quantum Gaussian process regression

We apply QGP regression on a one-dimensional dataset where the data generating function (cf. Eq. 1) is

$$\begin{aligned} f(x) = x \sin (x)\,. \end{aligned}$$

(12)

We assume that only noisy labels y can be observed with zero-mean Gaussian noise with a variance $\sigma ^2= (0.1)^2$ (cf. Eq. 1). The $n_{\text {training}} = 23$ training points are sampled from a uniform distribution in the interval of $[0,2\pi ]$, and we sample $n_{\text {test}} = 50$ equidistantly spaced test points.

The quantum kernel is calculated using a hardware-efficient feature map with variational parameters $\varvec{\theta }$ as depicted in Fig. 2 (Kreplin and Roth 2023). We encode the data using $q=4$ qubits and $l=2$ layers. To account for the limited domain of the non-linearity in the feature map, the labels y are scaled to the interval $[-1, 1]$.

To gauge the performance of the model under ideal conditions, we perform statevector simulations from which we obtain completely noiseless quantum kernels. The regression result can be seen in Fig. 3(a) where the mean prediction of the model is shown as a solid line and the standard deviation is depicted as a shaded area. Overall the method is able to achieve a good fit as is visible in the figure. The standard deviation that is obtained from the QGP variance has a reasonable behavior and is low in areas with high training point density and high in areas where training points are lacking.

Although good results can already be achieved using a general feature map, e.g., by choosing the parameters $\varvec{\theta }$ randomly (Haug et al. 2023; Jerbi et al. 2023), we adapt the kernel to the dataset using maximum likelihood optimization (cf. Eq. 9 and surrounding discussion), using an Adam optimizer with a learning rate of 0.1. The marginal log-likelihood as a function of optimization iterations can be seen in Fig. 4. In this example, the optimization leads to a reduction of the mean squared error (MSE) by about an order of magnitude (from 0.3 ($R^2=0.939$) to 0.02 ($R^2=0.996$)). We observe a convergence of the marginal log-likelihood after $\sim 80$ iterations. The specific optimization behavior is dependent on the chosen feature map design such as the number of qubits, layers and variational parameters. We use the optimal parameters obtained from this ideal simulation for subsequent noisy simulations and calculations on real quantum computers.

Any real quantum computation is ultimately affected by statistical errors. Figure 3(b) shows results of the same simulation as in Fig. 3(a) with sample-based estimation of the wavefunctions with a modest amount of $N=10,000$ measurements per evaluation point. These kinds of simulations are a good indicator of the future performance of the model in a regime with low hardware noise. Due to the statistical error in this simulation, the kernel is now only a noisy estimate $\tilde{k}$ of the true kernel k. As can be seen in the figure, the performance of the model is only slightly worse compared to the ideal simulation ($\text {MSE}=0.024$). Particularly, due to careful regularization of the Gram matrix the variance information can be retained reasonably well. As discussed in Section 2.1, regularization is required to preserve the semi-definiteness of $\tilde{k}$. A straightforward strategy would be to treat $\sigma $ in the implicit regularization of the GP (Eqs. (3)–(4)) as a hyperparameter — a technique that is commonly used in kernel ridge regression. Increasing $\sigma $ results in stronger regularization, however, this can lead to a loss in variance information as can be seen in Appendix C. This can be especially hindering for tasks that rely on this variance information such as a BO, which requires an adequate variance to perform exploration-exploitation trade-off, as discussed in the next section.

We conclude this example by running the QGP regression on real quantum hardware using the ibmq_montreal device. The results are shown in Fig. 3(c). We use readout error mitigation (Nation et al. 201), and dynamical decoupling (Ezzell et al. 2022) to mitigate the hardware errors. Compared to the simulations, the performance of the model slightly decreases with the method obtaining an error of $\text {MSE} = 0.114$ on the test data. Nevertheless, the mean prediction only marginally deviates from the true function. As expected, the regularization of the quantum kernel matrices has to be increased such that the overall standard deviation increases. Nevertheless, even on the real quantum computer the variance of the standard deviation of the prediction can still be retained such that one can clearly distinguish between areas of high and low uncertainty. This is a substantial improvement compared to previous results (Otten et al. 2020).

The quality of the solution and the posterior variance are dependent on the chosen quantum feature map. Appendix B shows results for the same dataset using a different feature map and a different quantum computer.

3.2 Quantum Bayesian optimization

We assess the QBO routine introduced in Section 2.2 by minimizing the two-dimensional Branin-Hoo function

$$\begin{aligned} f_{\text {bh}}(x) = a(x_2 -bx_1^2 + cx_1 - r)^2 + s(1-t) \cos (x_1) + s\,, \end{aligned}$$

(13)

where a, b, c, s, t are real parameters and $x_1 \!\in \! [-5,10]\text {, } x_2 \in [0,15]$. We fix the parameters such that the function has three global minima (cf. caption of Fig. 5). We substitute Eq. (13) into Eq. (1) to generate data with zero-mean Gaussian noise with a variance of $\sigma ^2 = (0.5)^2$.

The hardware-efficient feature map illustrated in Fig. 2 is utilized for the QGP model which is used as a surrogate model for the QBO. We encode the two-dimensional input vector with $q=4$ qubits which increases the model’s expressibility compared to a single encoding (Schuld et al. 2021). Every parameter $\theta $ in the feature map is sampled uniformly from the interval $[0,2\pi ]$ and kept fixed for the duration of the optimization.

Figure 5(a) shows the results for statevector (red line) and sample-based simulations (blue line) where the optimization has been averaged over 25 runs. The resulting standard deviation of the respective simulations is depicted as shaded areas. It can be seen that both, the BO using kernels obtained from the noiseless and the noisy simulations converge to the true minimum of the function. Especially for the sample-based simulation, this requires thoughtful regularization of the quantum Gram matrices. We compare the performance of the QBO routines to a classical BO with a GP using an RBF kernel. The RBF kernel is optimized in each iteration using maximum likelihood estimation. Despite this optimization which is not used by the QBO it can be seen that the classical and the quantum models perform comparably well.

To demonstrate the applicability of QBO to a real-world scenario, we use the algorithm to optimize the hyperparameters $\xi $ of a gradient boosting model $h(\varvec{x},\xi )$ (Chen and Guestrin 2016) that is applied to a regression task as illustrated in Fig. 1(c). The gradient boosting model is used to predict the price of industrial machinery with respect to different machine types, specifications, and amount of working hours. The dataset has been analyzed in detail in Stühler et al. (2023). In total, the dataset contains 2910 data points, and the one-hot encoding of the categorical features leads to 65 features in total. Further details are shown in Appendix A. For the optimization, we fix the categorical hyperparameters of the gradient boosting model and only optimize the five continuous hyperparameters (cf. Table 1). The objective function for the QBO is the cross-validated mean absolute error (MAE) of the gradient boosting model on the training dataset for a given set of hyperparameters.

Table 1 Hyperparameter space of the gradient boosting model

Full size table

We encode the five-dimensional hyperparameter vector with the feature map in Fig. 2 using $q=10$ qubits and $l=2$ layers. Figure 5(b) depicts the result for the different BO runs. Additionally, a random search is shown for comparison. As in the previous example, the QBO results are compared to a BO with a classical GP with an optimized RBF kernel. It can be seen that the results of the QBO are on par with the results of the classical BO. This is true for both, the statevector and the sample-based simulations. As expected, all BO approaches outperform the random search on average.

4 Discussion

In this study, we apply QGP models to one and multidimensional regression problems and show that they can be used as a surrogate model for BO to create a QBO. We demonstrate that QBO can be used to solve real-world hyperparameter optimization problems. Our encoding strategy allows for effectively using the variational parameters of the data embedding circuit as hyperparameters for the quantum kernels. In our simulations, we observe that the posterior variance of the QGP remains intact under the influence of sampling noise and even for the calculation on NISQ devices, although the influence of the various error sources in the latter affects the result. Nevertheless, since the results from the sampling-based simulations can be seen as an upper bound for future hardware capabilities, the outlook is optimistic.

Although we demonstrate the feasibility of using QBO to optimize hyperparameters of a machine learning model, the potential benefits of employing quantum kernels over classical machine learning methods in tasks using classical data remain uncertain (Chia et al. 2020). However, it is reasonable to expect that QBO may provide advantages in problems where quantum data can be leveraged to achieve a quantum advantage (Huang et al. 2022). Notably, QBO is potentially well-suited for active learning tasks in expensive molecular simulations, where the evaluation of the potential energy surface is based on quantum mechanics and is computationally expensive (Denzel and Kästner 2018a, b).

The performance of the QGP model remains unexplored in several avenues within this work. For example, the choice of feature map is a crucial aspect and it has been shown that choosing problem-specific feature map with an inductive bias that is tailored to the dataset has various advantages such as improved performance and trainability (Kübler et al. 2021; Cerezo et al. 2022). It is also known that using parametrized feature maps requires special care when scaling the number of qubits which can lead to exponential concentration (Thanasilp et al. 2022).

Moreover, in this work, we use fidelity-based kernels for the QGP. These have an unfavorable quadratic scaling with the size of the dataset as the pair-wise inner product of the data points have to be calculated. An alternative approach would be to use projected quantum kernels as proposed in Huang et al. (2021) which not only have a linear scaling but also are thought to have beneficial properties when the dimension of the feature space increases significantly. These alternative kernels could easily be integrated into the QGP and analyzed in future studies.

While the QGP models presented in this work feature a quantum calculation of the kernel, the majority of their operations are performed classically. However, there is potential for increased improvements by creating a fully quantum QGP with a quantum kernel and employing HHL-based inversion of the covariance matrix (Harrow et al. 2009; Zhao et al. 2019). Such an approach could leverage the benefits of both quantum kernels and quantum linear algebra subroutines, which would help overcome today’s limitation of GP models which are currently affected by an unfavorable scaling with the size of the dataset.^{Footnote 1}

Availability of data and materials

The dataset used for hyperparameter optimization on the machine values is taken from Stühler et al. (2023). All other datasets are available from the corresponding author on reasonable request. The source code is accessible on GitHub. Additional examples of QGPR and BO can be accessed through the documentation.

Notes

https://github.com/squlearn/squlearn

References

Archetti F (2019) Candelieri A (2019) Bayesian optimization and data science, 1st edn. Springer Publishing Company, Incorporated
Arute F, Arya K, Babbush R, Bacon D, Bardin J, Barends R, Biswas R, Boixo S, Brandao F et al (2019) Quantum supremacy using a programmable superconducting processor. Nature 574:505–510. https://www.nature.com/articles/s41586-019-1666-5
Biamonte J, Wittek P, Pancotti N, Rebentrost P, Wiebe N, Lloyd S (2017) Quantum machine learning. Nature 549:195. https://doi.org/10.1038/nature23474
Article Google Scholar
Bravyi S, Gosset D, König R, Tomamichel M (2020) Quantum advantage with noisy shallow circuits. Nat Phys 16:1040. https://doi.org/10.1038/s41567-020-0948-z
Article Google Scholar
Bruzewicz CD, Chiaverini J, McConnell R, Sage JM (2019) Trapped-ion quantum computing: Progress and challenges. Appl Phys Rev 6:021314. https://doi.org/10.1063/1.5088164
Article Google Scholar
Cerezo M, Verdon G, Huang H-Y, Cincio L, Coles P (2022) Challenges and opportunities in quantum machine learning. Nature Computational Science 2. https://doi.org/10.1038/s43588-022-00311-3
Chen M-H, Yu C-H, Gao J-L, Yu K, Lin S, Guo G-D, Li J (2022) Quantum algorithm for gaussian process regression. Phys Rev A 106:012406. https://doi.org/10.1103/PhysRevA.106.012406
Article MathSciNet Google Scholar
Chen T, Guestrin C (2016) XGBoost. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (ACM, 2016). https://doi.org/10.1145/2939672.2939785
Chia N-H, Gilyén A, Li T, Lin H-H, Tang E, Wang C (2020) Sampling-based sublinear lowrank matrix arithmetic framework for dequantizing quantum machine learning. In: Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing (ACM, 2020). https://doi.org/10.1145/3357713.3384314
Denzel A, Kästner J (2018a) Gaussian process regression for transition state search. J Chem Theory Comput 14. https://doi.org/10.1021/acs.jctc.8b00708
Denzel A, Kästner J (2018b) Gaussian process regression for geometry optimization. J Chem Phys 148. https://doi.org/10.1063/1.5017103
Dudley RM (2002) Real Analysis and Probability, 2nd ed., Cambridge Studies in Advanced Mathematics (Cambridge University Press, 2002). https://doi.org/10.1017/CBO9780511755347
Ezzell N, Pokharel B, Tewala L, Quiroz G, Lidar DA (2022) Dynamical decoupling for superconducting qubits: a performance survey. arXiv:2207.03670
Garnett R (2023) Bayesian Optimization (Cambridge University Press, 2023). https://doi.org/10.1017/9781108348973
Glick JR, Gujarati TP, Corcoles AD, Kim Y, Kandala A, Gambetta JM, Temme K (2022) Covariant quantum kernels for data with group structure. arXiv:2105.03406
Graepel T, Herbrich R, Bollmann-Sdorra P, Obermayer K (1998) Classification on pairwise proximity data. In: Advances in neural information processing systems, vol 11, edited by M. Kearns, S. Solla, and D. Cohn (MIT Press, 1998). https://proceedings.neurips.cc/paper_files/paper/1998/file/7bd28f15a49d5e5848d6ec70e584e625-Paper.pdf
Harrow AW, Hassidim A, Lloyd S, Quantum, (2009) algorithm for linear systems of equations. Phys Rev Lett 103. https://doi.org/10.1103/physrevlett.103.150502
Haug T, Self CN, Kim MS (2023) Quantum machine learning of large datasets using randomized measurements. Mach Learn: Sci Technol 4:015005. https://doi.org/10.1088/2632-2153/acb0b4
Article Google Scholar
Haug T, Self CN, Kim MS (2021) Large-scale quantum machine learning. https://doi.org/10.48550/ARXIV.2108.01039
Havlíček V, Córcoles AD, Temme K, w AW, Kandala A, Chow JM, Gambetta JM, (2019) Supervised learning with quantum-enhanced feature spaces. Nature 567:209. https://doi.org/10.1038/s41586-019-0980-2
Huang H-Y, Broughton M, Mohseni M, Babbush R, Boixo S, Neven H, McClean JR (2021) Power of data in quantum machine learning. Nat Commun 12. https://doi.org/10.1038/s41467-021-22539-9
Huang H-Y, Broughton M, Cotler J, Chen S, Li J, Mohseni M, Neven H, Babbush R, Kueng R, Preskill J, McClean JR (2022) Quantum advantage in learning from experiments. Science 376:1182. https://doi.org/10.1126/science.abn7293
Article MathSciNet Google Scholar
Hubregtsen T, Wierichs D, Gil-Fuster E, Derks P-JHS, Faehrmann PK, Meyer JJ (2021) Training quantum embedding kernels on nearterm quantum computers. https://doi.org/10.48550/ARXIV.2105.02276
Ibm quantum (2021). https://quantum-computing.ibm.com
Jerbi S, Fiderer LJ, Nautrup HP, Kübler JM, Briegel HJ, Dunjko V (2023) Quantum machine learning beyond kernel methods. Nat Commun 14. https://doi.org/10.1038/s41467-023-36159-y
Kreplin DA, Roth M (2023) Reduction of finite sampling noise in quantum neural networks. arXiv:2306.01639
Kübler JM, Buchholz S, Schölkopf B (2021) The inductive bias of quantum kernels. arXiv:2106.03747
Kyriienko O, Paine AE, Elfving VE (2021) Solving nonlinear differential equations with differentiable quantum circuits. Phys Rev A 103:052416. https://doi.org/10.1103/PhysRevA.103.052416
Article MathSciNet Google Scholar
Liu Y, Arunachalam S, Temme K (2021) A rigorous and robust quantum speed-up in supervised machine learning. Nat Phys 17:1013
Article Google Scholar
Madsen L, Laudenbach F, Askarani M, Rortais F, Vincent et al (2022) Quantum computational advantage with a programmable photonic processor. Nature 606:75. https://doi.org/10.1038/s41586-022-04725-x
McClean JR, Boixo S, Smelyanskiy VN, Babbush R, Neven H (2018) Barren plateaus in quantum neural network training landscapes. Nat Commun 9. https://doi.org/10.1038/s41467-018-07090-4
Mohammadi H, Riche RL, Durrande N, Touboul E, Bay X (2016) An analytic comparison of regularization methods for gaussian processes. https://doi.org/10.48550/ARXIV.1602.00853
Nation PD, Kang H, Sundaresan N, Gambetta JM (2021) Scalable mitigation of measurement errors on quantum computers. PRX Quantum 2:040326. https://doi.org/10.1103/PRXQuantum.2.040326
Article Google Scholar
Otten M, Goumiri IR, Priest BW, Chapline G, Schneider MD (2020) Quantum machine learning using gaussian processes with performant quantum kernels. arXiv: Quantum Physics
Preskill J (2018) Quantum computing in the NISQ era and beyond. Quantum 2:79. https://doi.org/10.22331/q-2018-08-06-79 https://doi.org/10.22331/q-2018-08-06-79
Qiskit Community (2017) Qiskit: an open-source framework for quantum computing. https://doi.org/10.5281/zenodo.2562110
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (The MIT Press, 2005). https://doi.org/10.7551/mitpress/3206.001.0001
Rebentrost P, Mohseni M, Lloyd S (2014) Quantum support vector machine for big data classification. Phys Rev Lett 113. https://doi.org/10.1103/physrevlett.113.130503
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Computational learning theory, edited by D. Helmbold and B. Williamson (Springer, Berlin, Heidelberg, 2001) pp 416–426
Schölkopf B, Smola A, Smola A, Smola A (2005) Support vector machines and kernel algorithms. Encyclopedia of Biostatistics 5328–5335
Schuld M (2021) Supervised quantum machine learning models are kernel methods. https://doi.org/10.48550/ARXIV.2101.11020
Schuld M, Killoran N (2019) Quantum machine learning in feature hilbert spaces. Phys Rev Lett 122. https://doi.org/10.1103/physrevlett.122.040504
Schuld M, Sweke R, Meyer JJ (2021) Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys Rev A 103. https://doi.org/10.1103/physreva.103.032430
Smith AWR, Paige AJ, Kim MS (2023) Faster variational quantum algorithms with quantum kernel-based surrogate models. Quantum Sci Technol 8:045016. https://doi.org/10.1088/2058-9565/aceb87
Article Google Scholar
Stühler H, Zöller M-A, Klau D, Beiderwellen-Bedrikow A, Tutschku C (2023) Benchmarking automated machine learning methods for price forecasting applications. In: Proceedings of the 12th international conference on data science, technology and applications (SCITEPRESS -Science and Technology Publications, 2023). https://doi.org/10.5220/0012051400003541
Thanasilp S, Wang S, Cerezo M, Holmes Z (2022) Exponential concentration and untrainability in quantum kernel methods. https://doi.org/10.48550/ARXIV.2208.11060
Vovk V (2013) Kernel ridge regression. In: Empirical inference (Springer, 2013) pp 105–116
Wang X, Du Y, Luo Y, Tao D (2021) Towards understanding the power of quantum kernels in the NISQ era. Quantum 5:531. https://doi.org/10.22331/q-2021-08-30-531
Article Google Scholar
Zhao Z, Fitzsimons JK, Fitzsimons JF (2019) Quantum-assisted gaussian process regression. Phys Rev A 99. https://doi.org/10.1103/physreva.99.052331

Download references

Acknowledgements

The authors would like to thank Horst Stühler for kindly providing the dataset. We acknowledge the use of IBM Quantum services for this work. The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work was supported by the German Federal Ministry of Economic Affairs and Climate Action through the project AutoQML (01MQ22002A).

Author information

Authors and Affiliations

Cyber Cognitive Intelligence, Fraunhofer Institute for Manufacturing Engineering and Automation (IPA), Nobelstrasse 12, Stuttgart, 70569, Germany
Frederic Rapp & Marco Roth

Authors

Frederic Rapp
View author publications
You can also search for this author in PubMed Google Scholar
Marco Roth
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.Rapp conducted the experiments and produced the figures. Both authors wrote the main manuscript. Both authors reviewed the manuscript.

Corresponding author

Correspondence to Marco Roth.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Industrial dataset

Table 2 Industrial dataset

Full size table

The dataset used in this work is shown in Table 2. It was part of a recent study (Stühler et al. 2023). It describes the pricing of industrial machinery in Euros, with respect to several features, e.g., model extensions and working hours. The categorical variables get one-hot encoded, and the price prediction is then carried out using a gradient boosting model.

Appendix B. QGP regression results with different feature maps

Different choices of feature maps lead to different kernels, and thus to different regression outcomes. Figure 6 shows the results of the QGP regression as discussed in Section 2 with a different choice of feature map, and on a different quantum computer. The feature map here is a hardware-efficient design proposed in Haug et al. (2021). We again use $q=4$ qubits and $l=2$ layers. The real quantum hardware used this time is the ibmq_ehningen backend. As visible in Fig. 6(a)–(c) the QGP model is able to regress the objective function very well with the different choices of feature map, and the variance information stays intact. Figure 7 shows that the loss again converges when optimizing the parameters $\varvec{\theta }$ of the feature map using the ideal statevector simulation.

Appendix C. Regularization detail

In this Appendix we compare the implicit regularization of the GP model to a more refined strategy based on the truncation of the spectrum of the Gram matrix.

As discussed in Section 2, potential noise in the objective function can be taken into account in the GP, by including the variance of the noise into the model Eq. (4). For objective functions with unknown variance, or as a means to increase the regularization strength, the parameter $\sigma ^2$ in Eq. 4 can be treated as a hyperparameter. In the following example we will perform sample-based simulations of the problem described in Section 3.1. Figure 8(a) shows the result when the model is regularized using only a tuning of $\sigma ^2$.

When optimizing this parameter, we observe a trade-off between stability and accuracy that especially affects the posterior variance. The result shown in Fig. 8(b) is the same as already discussed in Section 3.1 Fig. 3, where we used the known noise from the objective function, and further regularized by the eigenvalue-cutoff strategy instead of increasing $\sigma ^2$. Using this additional regularization results in a more nuanced variance with a larger difference between areas of low and high variance. This is particularly visible in the extreme points in Fig. 8(b). The effect becomes more pronounced when additional noise, e.g., on real quantum computers affects the calculation of the Gram matrix.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rapp, F., Roth, M. Quantum Gaussian process regression for Bayesian optimization. Quantum Mach. Intell. 6, 5 (2024). https://doi.org/10.1007/s42484-023-00138-9

Download citation

Received: 03 May 2023
Accepted: 26 December 2023
Published: 30 January 2024
DOI: https://doi.org/10.1007/s42484-023-00138-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Quantum Gaussian process regression for Bayesian optimization

Abstract