1 Introduction

Quantum computers are expected to have a profound impact on numerous areas in science and industry. The ongoing progress of quantum computing hardware (Madsen et al. 2022; Arute et al. 2019; Bruzewicz et al. 2019) is accompanied by intense algorithmic research activities which explore avenues towards achieving a quantum advantage beyond proof-of-principles (Bravyi et al. 2020; Biamonte et al. 2017). Quantum machine learning combines quantum computing and machine learning and is often deemed as one of the fields that could benefit from quantum computing early (Liu et al. 2021). While some quantum machine learning methods rely on running quantum versions of linear algebra subroutines for a speed-up (Harrow et al. 2009; Rebentrost et al. 2014; Zhao et al. 2019), these methods usually require deep quantum circuits that are beyond the capabilities of currently accessible noisy intermediate-scale quantum (NISQ) hardware (Preskill 2018).

Fig. 1
figure 1

Conceptual layout of the workflow used in this work. (a) The QGP model is constructed by calculating a quantum kernel and substituting the corresponding Gram matrix as covariance matrix into a classical GP. If the feature map used for the quantum kernel contains variational parameters, they can be optimized using maximum likelihood estimation (Eq. (9)). (b) By using a QGP model as a surrogate model for Bayesian optimization, a QBO can be obtained. (c) In Section 3.2, the QBO algorithm is used to optimize the hyperparameters \(\xi \) of a gradient boosting model \(h(\varvec{x},\xi )\) which performs regression on a dataset for remaining value estimation of industrial machines

Recently, quantum kernel methods have received much attention. These methods are appealing because they can be studied using the well-established toolbox of classical kernel theory (Schölkopf et al. 2001; Schuld 2021). Furthermore, using a suitable feature map, they can be implemented on available NISQ devices (Havlíček et al. 2019). The general idea is to project the data into the Hilbert space of a quantum computer using a quantum feature map. By calculating pair-wise inner products of data points, a kernel matrix can be calculated which can then be used in classical methods such as support vector machines or kernel ridge regression (Schölkopf et al. 2002; Vovk 2013). The expectation is that by encoding the data into a quantum Hilbert space, the feature map can be enriched with non-classical resources that provide an advantage compared to classical feature maps. This has already been demonstrated for tailored datasets (Liu et al. 2021; Huang et al. 2022).

While quantum versions of kernel machines like the support vector machine (Rebentrost et al. 2014; Havlíček et al. 2019) have been the focus of recent studies, quantum variants of probabilistic kernel methods have not received as much attention in the NISQ context. In this work, we use quantum kernels to create quantum Gaussian processes (QGP). Gaussian process (GP) models are popular machine learning methods based on Bayesian inference. GPs are specified by a covariance matrix which can be obtained by calculating the Gram matrix of a kernel function for a given dataset. Given their probabilistic nature, GPs have the desirable property of providing a variance for their predictions which allows uncertainty quantification.

Earlier investigation of QGPs includes HHL-based algorithms which require fault-tolerant quantum computers (Chen et al. 2022; Zhao et al. 2019) and methods that rely on classically evaluated quantum kernels (Smith et al. 2023). Additionally, quantum approximations of classical kernels have been studied and have raised the question whether the variance information can be retained in noisy near-term devices (Otten et al. 2020). In this work, we present a NISQ-ready QGP model which uses a hardware-efficient, parameterized feature map. We show that this model can be used in practically relevant scenarios. To this end, we demonstrate that careful regularization of the Gram matrix can help preserve the variance and show how the overall performance can be improved with an end-to-end optimization using log-likelihood optimization. We show the capabilities of the QGP model by using it as a surrogate model for a Bayesian optimization (BO) (Archetti and Candelieri 2019), a task that critically relies on the variance information of the surrogate model. We benchmark the resulting quantum Bayesian optimization (QBO) against optimizations using a surrogate model based on conventional GPs and show that QBO can match their performance on the task of optimizing the multidimensional hyperparameters of a classical machine learning model. The hyperparameter optimization is performed on a regression task of a real-world dataset which evaluates the remaining value of used industrial machinery. Figure 1 gives an overview of the various components used in this work.

The manuscript is structured as follows. In Section 2, we provide an introduction to the fundamentals of QGPs by briefly discussing GP theory and exploring quantum kernels. Subsequently, we illustrate the concept of quantum BO using a QGP surrogate model. In Section 3, we demonstrate the versatility and effectiveness of QGP models through our analysis of a one-dimensional dataset, followed by their successful application in QBO for the purpose of minimizing a multidimensional function and identifying the optimal hyperparameters of a machine learning model. We present the results of our simulations, including those obtained from noiseless and sample-based experiments, as well as the outcomes from a real quantum computing backend.

2 Quantum Gaussian process regression

Gaussian process regression is a non-parametric Bayesian machine learning method (Rasmussen and Williams 2005). It can be used to solve a regression problem of the form

$$\begin{aligned} y = f(\varvec{x}) + \epsilon \,, \end{aligned}$$
(1)

where \(f(\varvec{x})\) is a data generating function, with labels \(y \in \mathbb {R}\), observed data \(\varvec{x} \in \mathcal {X} \subset \mathbb {R}^d\) and independent zero-mean Gaussian noise \(\epsilon \sim \mathcal {N}(0,\sigma ^2)\). If f is a random function with a Gaussian prior distribution then the function values can be taken as random variables that form a GP. We denote the GP as \(\mathcal{G}\mathcal{P}(m,k)\) with a mean function m and a covariance function k. Note that k is mathematically equivalent to a kernel function, we will therefore refer to it as kernel in the following. A GP is a collection of random variables such that any finite subset is Gaussian distributed (Rasmussen and Williams 2005). Concretely, for collection of data points \(X:= (\varvec{x}_1,\dots ,\varvec{x}_n),\varvec{x}_i \in \mathcal {X}\) the variables \(f(\varvec{x}_i)_{i=1}^n\) are jointly distributed by a multivariate Gaussian distribution such that

$$\begin{aligned} f(\varvec{x}) \sim \mathcal {N}(m(\varvec{x}),k(\varvec{x}, \varvec{x'}))\,, \end{aligned}$$

GPs are thus distributions over functions specified by the covariance k (Dudley 2002).

To predict the values \(f_*\) of new data points \(X_*\) (test points), we can calculate the posterior distribution given X and \(X_*\)

$$\begin{aligned} p(f_* \vert X_*, X, f) = \mathcal {N}(f_* ; \mu _*, \Sigma _*)\,. \end{aligned}$$
(2)

GP regression thus not only yields a prediction for the mean \(\mu _*\) but also for the covariance \(\Sigma _*\). They are given by

$$\begin{aligned}{} & {} \mu _* = k_{X X_*}^T (k_{XX}+ \sigma ^2 \varvec{I})^{-1} f, \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \Sigma _* = k_{X_* X_*} - k_{X X_*}^T (k_{XX}+ \sigma ^2 \varvec{I})^{-1} k_{XX_*}. \end{aligned}$$
(4)

The elements of the Gram matrices \(k_{XX}\), \(k_{XX_*}\) and \(k_{X_* X_*}\) are the pair-wise inner products of the training points, the training and test points, and the test points, respectively. Note that here we have assumed that we only have access to noisy labels as in Eq. 1. The variance of this noise can be explicitly taken into account in the calculation of the mean and the variance. This serves as an implicit regularization which often results in a better conditioned posterior covariance matrix.

Equations 24 show that the outcome of the GP is fully governed by the choice of the kernel. In general, a kernel is a positive definite function \(k: \chi \times \chi \rightarrow \mathbb {R}\), which serves as a similarity measure between pairs of inputs \(\varvec{x}\) and \(\varvec{x'}\). Specifically, the kernel computes the inner product of the corresponding feature vectors \(\phi (\varvec{x})\) and \(\phi (\varvec{x'})\)

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = \langle \phi (\varvec{x}),\phi (\varvec{x'}) \rangle _\mathcal {F}, \end{aligned}$$
(5)

in a potentially high-dimensional feature space \(\mathcal {F}\), where the feature map \(\phi (\varvec{x})\) is a non-linear map from the input space \(\chi \) to the feature space \(\mathcal {F}\).

2.1 Quantum kernels

Kernels can be constructed by embedding data into the Hilbert space of a quantum system (Schuld and Killoran 2019; Havlíček et al. 2019) (see Fig. 1(a)). The resulting quantum state is

$$\begin{aligned} |\phi (\varvec{x};\varvec{\theta })\rangle = U(\varvec{x};\varvec{\theta })|{0}\rangle \,. \end{aligned}$$
(6)

The unitary operator \(U(\varvec{x};\varvec{\theta })\) implements the quantum feature map quantum feature map \(\phi \). It encodes the classical data point \(\varvec{x}\) into a quantum state. In principle, it can depend on additional parameters \(\varvec{\theta }\) that can be trained variationally (Hubregtsen et al. 2021). Using the feature map in Eq. 6, a quantum kernel can be defined in terms of the Hilbert-Schmidt inner product

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = \text {Tr} \left[ \rho (\varvec{x}) \rho (\varvec{x'}) \right] , \end{aligned}$$
(7)

with the density matrix \(\rho (\varvec{x})=U(\varvec{x})|0\rangle \!\langle 0| U^\dag (\varvec{x})\). It can be shown that this definition results in a positive definite kernel (Schuld 2021). For pure states, Eq. 7 reduces to the overlap between the states encoding the data points such that in practice the kernel elements can be calculated by applying the feature map and its inverse to \(\varvec{x}\) and \(\varvec{x'}\) and measuring the occupation of the ground state

$$\begin{aligned} k(\varvec{x},\varvec{x'}) = |\langle \phi (\varvec{x'})|{\phi (\varvec{x})}\rangle |^2 = \big |\langle {0}| U (\varvec{x'})^\dagger U(\varvec{x})|0 \rangle \big |^2. \end{aligned}$$
(8)

Other methods, like the SWAP test, can also calculate the overlap between two states. However, the method above is more efficient in terms of the number of qubits (Hubregtsen et al. 2021).

From Eq. 8 it becomes clear that the defining quantity for a quantum kernel k is the quantum feature map \(\phi \). The choice of an optimal embedding strategy is an open research question such that the feature maps are often chosen heuristically. Finally, to obtain a quantum GP, we substitute a quantum kernel Eq. (7) into the definition of the variance of a GP model Eq. (4). This is illustrated in Fig. 1(a).

The variational parameters in Eq. 6 can be trained using various methods. Popular approaches for quantum kernel machines such as quantum support vector machines or quantum kernel ridge regression often optimize the kernel directly using, e.g., kernel alignment techniques (Hubregtsen et al. 2021; Kübler et al. 2021; Glick et al. 2022). In this work we make use of the Bayesian framework of GPs and train the QGP model end-to-end by maximizing the marginal log-likelihood. Due to the Gaussian form of the posterior (cf. Eq. 2), the marginal log-likelihood can be given in closed form (Rasmussen and Williams 2005)

$$\begin{aligned} \begin{aligned} \log p(y\vert X) =&-\frac{1}{2}y^T (k_{XX}(\varvec{\theta })+\sigma ^2 I)^{-1} y \\&- \frac{1}{2} \log \det (k_{XX}(\varvec{\theta })+\sigma ^2 I)\,. \end{aligned} \end{aligned}$$
(9)

Here, \(k_{XX}(\varvec{\theta })\) indicates the dependence of the kernel on the parameters \(\varvec{\theta }\) through the parameterized feature map. The optimization workflow is sketched in Fig. 1(a). Optimizing parameterized quantum circuit is an active area of research with open questions such as how to avoid barren plateaus during training (McClean et al. 2018).

In practice, the kernel elements in Eq. 8 can only be computed approximately because any observable has to be determined using a finite amount of measurements. The resulting statistical error scales as \(\mathcal {O}(1/\sqrt{N})\) where N is the number of measurements. In addition, available NISQ devices suffer from a multitude of noise sources such as short coherence times, gate errors and cross-talk. As a result the estimated kernel \(\tilde{k}\) deviates from the true kernel k. To ensure that \(\tilde{k}\) is positive definite, we need to apply regularization techniques. Taking account of the variance for noisy objective functions as done in Eqs. 34 already serves as an inherent regularization. Nevertheless, for noiseless objective functions or for noisy estimates \(\tilde{k}\), this might not be sufficient to ensure positive definiteness. Therefore, we employ an eigenvalue-cutoff strategy, where the spectrum of the full Gram matrix is truncated at zero (Graepel et al. 1998). This requires a full eigenvalue decomposition of the Gram matrix followed by a reconstruction using the truncated spectrum and the original eigenvectors (Hubregtsen et al. 2021). This technique has already been shown to provide good results (Wang et al. 2021). Additionally, compared to other methods such as shifting the spectrum by the lowest eigenvalue, the truncation does not introduce a constant offset to the variance of the GP model which is desirable for applications where the quantification of uncertainty is required. In general, the regularization of Gram matrices used for GP regression is problem-specific and non-trivial, even for classical kernels (Mohammadi et al. 2016).

In this work, we are interested in using QGP models as surrogate models in Bayesian optimization. This is explained in the next section and illustrated in Fig. 1(b).

2.2 Quantum Bayesian optimization

Fig. 2
figure 2

Example of the hardware-efficient feature map with \(q=4\) qubits and \(l=1\) layers, inspired by a Chebychev quantum feature map design (Kyriienko et al. 2021). The trainable parameters are denoted by \(\theta _i\) and the data points by x. For the results in this work, various values of q and l are used

Bayesian optimization (Garnett 2023) is a global optimization method that solves problems of the form

$$\begin{aligned} \varvec{x}^* = \text {arg} \min _{\varvec{x}} g(\varvec{x})\,. \end{aligned}$$
(10)

The optimization is performed iteratively where the next sample is chosen using information obtained from previous iterations. Through this informed guidance, BO usually requires a modest amount of samples which makes it attractive for problems where the evaluation of g is expensive. BO treats g as a black-box such that there are no further restrictions regarding its functional form.

The algorithm is initialized by drawing a random sample and fitting a surrogate model as a proxy for g. The next sample is then chosen by considering an exploitation-exploration trade-off which is quantified by an acquisition function. This procedure is then repeated such that the surrogate model approximates the true function increasingly well. Due to their posterior variance output GP models are popular choices for surrogates. A common choice for an acquisition function is the expected improvement (EI) (Archetti and Candelieri 2019) which measures the expectation of the improvement on the objective \(g(\varvec{x})\) with respect to the predictive distribution of the surrogate model. The EI function is given by

$$\begin{aligned} \textrm{EI}(\varvec{x}) = [g(\varvec{x}^+) - \mu (\varvec{x}) - \lambda ] \varvec{\Phi }(Z) + \Sigma (\varvec{x}) \varphi (Z)\,, \end{aligned}$$
(11)

and \(\textrm{EI}=0\) for \(\Sigma (\varvec{x}) = 0\). Here \(\mu (\varvec{x})\) and \(\Sigma (\varvec{x})\) are the posterior mean prediction and the prediction uncertainty of the surrogate model at position \(\varvec{x}\), and \(\varphi (Z)\), and \(\varvec{\Phi }(Z)\) are the probability distribution and the cumulative distribution of the standard normal distribution. The location of the best sample, i.e., the current observed minimum of the surrogate model, is indicated by \(\varvec{x}^+\). The standardized prediction error Z is given by \(Z = [g(\varvec{x}^+) - \mu (\varvec{x}) - \lambda ]/\Sigma (\varvec{x})\) if \(\Sigma (\varvec{x}) > 0\) and \(Z=0\) if \(\Sigma (\varvec{x}) = 0\). The parameter \(\lambda \) in Eq. 11 is a hyperparameter that controls the exploitation-exploration trade-off, where a high value of \(\lambda \) favors exploration.

We obtain a quantum Bayesian optimization (QBO) algorithm by using a QGP model as a surrogate model. This has the potential to enhance BO for scenarios where quantum kernels have an advantage over classical kernels. A possible drawback is that the exploitation-exploration trade-off, which depends on the model variance is now influenced by quantum computing noise sources. To demonstrate the QBOs capabilities, we apply it to several test cases which is shown in the next section.

3 Results

We illustrate the capabilities of QGP models on a one-dimensional regression problem. We then demonstrate the feasibility of using QBO with a QGP surrogate model on two multidimensional optimization tasks. The quantum circuits for the QGP models are implemented using Qiskit (Qiskit Community 2017). The linear systems for the GPs are solved using a Cholesky decomposition of the Gram matrices. We validate the algorithm using numerical simulators provided by Qiskit. Results from real quantum computers are obtained from ibmq_montreal (Ibm 2021).

Fig. 3
figure 3

QGP regression on a dataset created using Eq. 12 (black line). The results are obtained using the feature map in Fig. 2 with \(q=4\) qubits and \(l=2\) layers for the encoding, \(n_{\text {training}} = 23\) training points, shown as the blue crosses. The test points are marked by the red dots. The posterior mean of the QGP is shown as the red line and the standard deviation as the shaded area. (a) shows the result of the statevector simulation with optimized parameters, obtaining an \(R^2\) score of 0.996 and an \( \text {MSE} = 0.022 \). (b) shows the result of the sample-based simulation. We use the optimal parameters obtained in the previous ideal run, resulting in an \(R^2\) score of 0.996 and an \( \text {MSE} = 0.024 \). (c) shows the result of the real hardware run, using the ibmq_montreal backend, leading to an \(R^2\) score of 0.978 and an \( \text {MSE} = 0.114 \). All runs use the same parameters

3.1 Quantum Gaussian process regression

We apply QGP regression on a one-dimensional dataset where the data generating function (cf. Eq. 1) is

$$\begin{aligned} f(x) = x \sin (x)\,. \end{aligned}$$
(12)

We assume that only noisy labels y can be observed with zero-mean Gaussian noise with a variance \(\sigma ^2= (0.1)^2\) (cf. Eq. 1). The \(n_{\text {training}} = 23\) training points are sampled from a uniform distribution in the interval of \([0,2\pi ]\), and we sample \(n_{\text {test}} = 50\) equidistantly spaced test points.

The quantum kernel is calculated using a hardware-efficient feature map with variational parameters \(\varvec{\theta }\) as depicted in Fig. 2 (Kreplin and Roth 2023). We encode the data using \(q=4\) qubits and \(l=2\) layers. To account for the limited domain of the non-linearity in the feature map, the labels y are scaled to the interval \([-1, 1]\).

To gauge the performance of the model under ideal conditions, we perform statevector simulations from which we obtain completely noiseless quantum kernels. The regression result can be seen in Fig. 3(a) where the mean prediction of the model is shown as a solid line and the standard deviation is depicted as a shaded area. Overall the method is able to achieve a good fit as is visible in the figure. The standard deviation that is obtained from the QGP variance has a reasonable behavior and is low in areas with high training point density and high in areas where training points are lacking.

Although good results can already be achieved using a general feature map, e.g., by choosing the parameters \(\varvec{\theta }\) randomly (Haug et al. 2023; Jerbi et al. 2023), we adapt the kernel to the dataset using maximum likelihood optimization (cf. Eq. 9 and surrounding discussion), using an Adam optimizer with a learning rate of 0.1. The marginal log-likelihood as a function of optimization iterations can be seen in Fig. 4. In this example, the optimization leads to a reduction of the mean squared error (MSE) by about an order of magnitude (from 0.3 (\(R^2=0.939\)) to 0.02 (\(R^2=0.996\))). We observe a convergence of the marginal log-likelihood after \(\sim 80\) iterations. The specific optimization behavior is dependent on the chosen feature map design such as the number of qubits, layers and variational parameters. We use the optimal parameters obtained from this ideal simulation for subsequent noisy simulations and calculations on real quantum computers.

Any real quantum computation is ultimately affected by statistical errors. Figure 3(b) shows results of the same simulation as in Fig. 3(a) with sample-based estimation of the wavefunctions with a modest amount of \(N=10,000\) measurements per evaluation point. These kinds of simulations are a good indicator of the future performance of the model in a regime with low hardware noise. Due to the statistical error in this simulation, the kernel is now only a noisy estimate \(\tilde{k}\) of the true kernel k. As can be seen in the figure, the performance of the model is only slightly worse compared to the ideal simulation (\(\text {MSE}=0.024\)). Particularly, due to careful regularization of the Gram matrix the variance information can be retained reasonably well. As discussed in Section 2.1, regularization is required to preserve the semi-definiteness of \(\tilde{k}\). A straightforward strategy would be to treat \(\sigma \) in the implicit regularization of the GP (Eqs. (3)–(4)) as a hyperparameter — a technique that is commonly used in kernel ridge regression. Increasing \(\sigma \) results in stronger regularization, however, this can lead to a loss in variance information as can be seen in Appendix C. This can be especially hindering for tasks that rely on this variance information such as a BO, which requires an adequate variance to perform exploration-exploitation trade-off, as discussed in the next section.

Fig. 4
figure 4

Convergence plot of the log-likelihood loss function (cf. Eq. 9), the loss is entirely evaluated on the training data. The variable parameter of the optimization are the angles \(\varvec{\theta }\) in the feature map

We conclude this example by running the QGP regression on real quantum hardware using the ibmq_montreal device. The results are shown in Fig. 3(c). We use readout error mitigation (Nation et al. 201), and dynamical decoupling (Ezzell et al. 2022) to mitigate the hardware errors. Compared to the simulations, the performance of the model slightly decreases with the method obtaining an error of \(\text {MSE} = 0.114\) on the test data. Nevertheless, the mean prediction only marginally deviates from the true function. As expected, the regularization of the quantum kernel matrices has to be increased such that the overall standard deviation increases. Nevertheless, even on the real quantum computer the variance of the standard deviation of the prediction can still be retained such that one can clearly distinguish between areas of high and low uncertainty. This is a substantial improvement compared to previous results (Otten et al. 2020).

The quality of the solution and the posterior variance are dependent on the chosen quantum feature map. Appendix B shows results for the same dataset using a different feature map and a different quantum computer.

3.2 Quantum Bayesian optimization

We assess the QBO routine introduced in Section 2.2 by minimizing the two-dimensional Branin-Hoo function

$$\begin{aligned} f_{\text {bh}}(x) = a(x_2 -bx_1^2 + cx_1 - r)^2 + s(1-t) \cos (x_1) + s\,, \end{aligned}$$
(13)

where abcst are real parameters and \(x_1 \!\in \! [-5,10]\text {, } x_2 \in [0,15]\). We fix the parameters such that the function has three global minima (cf. caption of Fig. 5). We substitute Eq. (13) into Eq. (1) to generate data with zero-mean Gaussian noise with a variance of \(\sigma ^2 = (0.5)^2\).

The hardware-efficient feature map illustrated in Fig. 2 is utilized for the QGP model which is used as a surrogate model for the QBO. We encode the two-dimensional input vector with \(q=4\) qubits which increases the model’s expressibility compared to a single encoding (Schuld et al. 2021). Every parameter \(\theta \) in the feature map is sampled uniformly from the interval \([0,2\pi ]\) and kept fixed for the duration of the optimization.

Figure 5(a) shows the results for statevector (red line) and sample-based simulations (blue line) where the optimization has been averaged over 25 runs. The resulting standard deviation of the respective simulations is depicted as shaded areas. It can be seen that both, the BO using kernels obtained from the noiseless and the noisy simulations converge to the true minimum of the function. Especially for the sample-based simulation, this requires thoughtful regularization of the quantum Gram matrices. We compare the performance of the QBO routines to a classical BO with a GP using an RBF kernel. The RBF kernel is optimized in each iteration using maximum likelihood estimation. Despite this optimization which is not used by the QBO it can be seen that the classical and the quantum models perform comparably well.

Fig. 5
figure 5

BO results averaged over independent runs with the mean shown as solid lines and the variance as shades. The expected improvement (Eq. (11)) is used as acquisition function with an exploration-exploitation parameter of \(\lambda = 0.1\) The classical BO uses a GP surrogate model with an optimized RBF kernel (black line). The QBO results are obtained with the feature map in Fig. 2 using statevector (red line) and sample-based simulations (blue line). The initial samples for each individual run are the same for the quantum and classical QBO for better comparison. At each iteration, only the best current result is shown. (a) shows the result for the minimization Eq. (13) where the parameters are fixed at \(a=1\text {, }b=5.1/(4\pi ^2)\text {, }c = 5/\pi \text {, }r=6\text {, }s=10 \text { and } t=1/8\pi \). The feature map for the QBO uses \(q=4\) qubits and \(l=2\) layers. The results are averaged over 25 runs. (b) shows the result of the hyperparameter optimization of a gradient boosting model on an industrial dataset. The average result of ten iterations of random search runs is shown (green, solid). The kernel is calculated using \(q=10\) qubits and \(l=2\) layers

To demonstrate the applicability of QBO to a real-world scenario, we use the algorithm to optimize the hyperparameters \(\xi \) of a gradient boosting model \(h(\varvec{x},\xi )\) (Chen and Guestrin 2016) that is applied to a regression task as illustrated in Fig. 1(c). The gradient boosting model is used to predict the price of industrial machinery with respect to different machine types, specifications, and amount of working hours. The dataset has been analyzed in detail in Stühler et al. (2023). In total, the dataset contains 2910 data points, and the one-hot encoding of the categorical features leads to 65 features in total. Further details are shown in Appendix A. For the optimization, we fix the categorical hyperparameters of the gradient boosting model and only optimize the five continuous hyperparameters (cf. Table 1). The objective function for the QBO is the cross-validated mean absolute error (MAE) of the gradient boosting model on the training dataset for a given set of hyperparameters.

Table 1 Hyperparameter space of the gradient boosting model

We encode the five-dimensional hyperparameter vector with the feature map in Fig. 2 using \(q=10\) qubits and \(l=2\) layers. Figure 5(b) depicts the result for the different BO runs. Additionally, a random search is shown for comparison. As in the previous example, the QBO results are compared to a BO with a classical GP with an optimized RBF kernel. It can be seen that the results of the QBO are on par with the results of the classical BO. This is true for both, the statevector and the sample-based simulations. As expected, all BO approaches outperform the random search on average.

4 Discussion

In this study, we apply QGP models to one and multidimensional regression problems and show that they can be used as a surrogate model for BO to create a QBO. We demonstrate that QBO can be used to solve real-world hyperparameter optimization problems. Our encoding strategy allows for effectively using the variational parameters of the data embedding circuit as hyperparameters for the quantum kernels. In our simulations, we observe that the posterior variance of the QGP remains intact under the influence of sampling noise and even for the calculation on NISQ devices, although the influence of the various error sources in the latter affects the result. Nevertheless, since the results from the sampling-based simulations can be seen as an upper bound for future hardware capabilities, the outlook is optimistic.

Although we demonstrate the feasibility of using QBO to optimize hyperparameters of a machine learning model, the potential benefits of employing quantum kernels over classical machine learning methods in tasks using classical data remain uncertain (Chia et al. 2020). However, it is reasonable to expect that QBO may provide advantages in problems where quantum data can be leveraged to achieve a quantum advantage (Huang et al. 2022). Notably, QBO is potentially well-suited for active learning tasks in expensive molecular simulations, where the evaluation of the potential energy surface is based on quantum mechanics and is computationally expensive (Denzel and Kästner 2018a, b).

The performance of the QGP model remains unexplored in several avenues within this work. For example, the choice of feature map is a crucial aspect and it has been shown that choosing problem-specific feature map with an inductive bias that is tailored to the dataset has various advantages such as improved performance and trainability (Kübler et al. 2021; Cerezo et al. 2022). It is also known that using parametrized feature maps requires special care when scaling the number of qubits which can lead to exponential concentration (Thanasilp et al. 2022).

Moreover, in this work, we use fidelity-based kernels for the QGP. These have an unfavorable quadratic scaling with the size of the dataset as the pair-wise inner product of the data points have to be calculated. An alternative approach would be to use projected quantum kernels as proposed in Huang et al. (2021) which not only have a linear scaling but also are thought to have beneficial properties when the dimension of the feature space increases significantly. These alternative kernels could easily be integrated into the QGP and analyzed in future studies.

While the QGP models presented in this work feature a quantum calculation of the kernel, the majority of their operations are performed classically. However, there is potential for increased improvements by creating a fully quantum QGP with a quantum kernel and employing HHL-based inversion of the covariance matrix (Harrow et al. 2009; Zhao et al. 2019). Such an approach could leverage the benefits of both quantum kernels and quantum linear algebra subroutines, which would help overcome today’s limitation of GP models which are currently affected by an unfavorable scaling with the size of the dataset.Footnote 1