1 Introduction

Quantum computers have the potential to efficiently solve computational problems believed to be intractable for classical computers, such as factoring (Shor, 1999) or simulating quantum systems (Georgescu et al., 2014). In the current noisy intermediate-scale quantum era (Preskill, 2018), fault-tolerant quantum algorithms face many limitations (e.g., the number of qubits, decoherence, etc.). Consequently, hybrid quantum-classical algorithms were introduced to target practical applications under these constraints, such as chemistry (Moll, 2018), combinatorial optimization (Farhi et al., 2014), and machine learning (Benedetti et al., 2019). In exceptional cases, quantum models can exhibit exponential advantages over classical ones (Jerbi et al., 2021; Liu et al., 2021; Sajjan et al., 2022; Sweke et al., 2021). More theoretical works also study these models from a generalization perspective (Caro et al., 2021). Quantum neural networks are quantum circuits with adjustable parameters that have been used to tackle regression (Mitarai et al., 2018), classification (Havlíček et al., 2019), generative adversarial learning (Zoufal et al., 2019), and reinforcement learning tasks (Jerbi et al., 2021; Skolik et al., 2022).

However, the value of such quantum models is still to be delved into in any larger-scale methodical fashion and on real-world datasets (Haug et al., 2023; Peters et al., 2021). Presently, standard practices from machine learning, such as large-scale benchmarking, hyperparameter importance, and analysis, have been challenging tools to use in the quantum community (Schuld & Killoran, 2022). Numerous ways to design quantum circuits for machine learning tasks call for hyperparameter optimization and other methods from the field of automated machine learning (Hutter et al., 2019). However, more intuition is needed on which quantum machine learning hyperparameters are important to optimize and which are less important for efficient hyperparameter optimization (Brazdil et al., 2022; Feurer et al., 2020; Mohr & van Rijn, 2022).

In order to broach this question, we employ functional ANOVA (Hutter et al., 2014; Sobol, 1993), a statistical framework which can be used for assessing hyperparameter importance. To obtain more general results, we follow and extend the methodologies provided by van Rijn and Hutter (2018); Sharma et al. (2019), who employed functional ANOVA across datasets. Our work distinguishes itself from the previous work by applying it to the challenging case of simulated quantum neural networks, which come at a significantly increased computational cost compared to the conventional models.

We selected a subset of several low-dimensional datasets from the OpenML-CC18 benchmark (Bischl et al., 2021) matching the current scale of simulations of quantum hardware. We defined a configuration space consisting of ten hyperparameters identified based on a literature study. We then apply the functional ANOVA framework across datasets and extend it with a critical verification step of the internal surrogate models. This results in a ranking of the importance of hyperparameters across datasets. We also perform an extensive experiment to verify whether the importance ranking of hyperparameters holds in practice. Our main findings align with existing knowledge and reveal new insights. For instance, setting the learning rate well is found to be the most critical hyperparameter, whereas the particular choice of entangling gates used is identified as the least important on all except one dataset.

Finally, we demonstrate the usefulness of our insights on hyperparameter importance within a hyperparameter optimization procedure. Following the methodology used in van Rijn and Hutter (2018), we learned data-driven priors based on values that achieve good performance on previously seen datasets for each hyperparameter. We utilize these priors in the hyperparameter optimization method hyperband, and compare it to the original version of hyperband, which samples hyperparameter configurations uniformly across the input space. We show that the data-driven priors improve performance by \(0.53 \%\), up to \(6.11 \%\). Extending from van Rijn and Hutter (2018), we also demonstrate that such improvements hold on average regardless of the configuration of hyperband. Indeed, across various instantiations of hyperband considered in our study, we improve over the original version of hyperband in \(77.2 \%\) of cases when sampling according to the data-driven priors.

This paper is an invited extended edition to the original version (Moussa et al., 2022), including the additional methodology and experiments on the data-driven priors for quantum neural networks. We make all of our results and experimental scripts publicly available.Footnote 1

2 Background

This section introduces the necessary background on functional ANOVA, quantum computing, and quantum circuits with adjustable parameters for supervised learning.

2.1 Functional ANOVA

When applying a new machine learning algorithm to solve a specific task, it is not known a priori which hyperparameters to optimize, what are the good ranges for these hyperparameters to sample from, and which values in these user-defined ranges are suitable to get high performance. Several techniques exist that assess hyperparameter importance, such as forward selection (Hutter et al., 2013), ablation analysis (Biedenkapp et al., 2017), local parameter importance (Biedenkapp et al., 2019), and functional ANOVA (Hutter et al., 2014; Saltelli & Sobol, 1995). These techniques are typically used as a post-hoc procedure, stating which hyperparameters were most influential after executing the hyperparameter optimization process. The work of van Rijn and Hutter (2018) showed that these findings could generalize across datasets.

We first introduce the relevant notation, based on the work by Hutter et al. (2014).

Let A be a machine learning algorithm that has n hyperparameters with domains \(\Theta _1, \ldots , \Theta _n\) and configuration space \(\varvec{\Theta }= \Theta _1 \times \ldots \times \Theta _n\). An instantiation of A is a vector \(\varvec{\theta }= \{ \theta _1, \ldots , \theta _n \}\) with \(\theta _i \in \Theta _i\), for all \(i \in [n] = \{1, \dots , n\}\) (this is also called a configuration of A). A partial instantiation of A is a vector \(\varvec{\theta }_U = \{ \theta _{i_1}, \ldots , \theta _{i_k} \}\) with a subset \(U=\{i_1, \ldots , i_k\} \subseteq N = [n]\) of the hyperparameters fixed, and the values for other hyperparameters unspecified. Note that \(\varvec{\theta }_N = \varvec{\theta }\).

Functional ANOVA is based on the concept of a marginal of a hyperparameter, i.e., how a given value for a hyperparameter performs, averaging over all possible combinations of the other hyperparameters’ values. The marginal performance \(\hat{a}_U(\varvec{\theta }_U)\) is described as the average performance of all complete instantiations \(\varvec{\theta }\) that have the same values for hyperparameters that are in \(\varvec{\theta }_U\). As an illustration, Fig. 1 shows marginals for two hyperparameters of a quantum neural network and their union (see also Fig. 9 in Appendix A). As the number of terms to consider for the marginal can be very large, the authors of Hutter et al. (2014) used tree-based surrogate regression models to calculate the average performance efficiently. Surrogate models operate on the meta-level: they take as input the hyperparameter values of a certain configuration, and map this to a given performance score of this configuration (Eggensperger et al., 2015). As such, it can be used to assess the scores of configurations that we have not observed directly, serving as a surrogate for running the actual model and determining its performance. Such a model yields predictions \(\hat{y}\) for the performance measure p of arbitrary hyperparameter settings.

Functional ANOVA determines how much each hyperparameter (and each combination of hyperparameters) contributes to the variance of \(\hat{y}\) across the algorithm’s hyperparameter space \(\varvec{\Theta }\), denoted \(\mathbb {V}\). Intuitively, it assumes that a hyperparameter is highly important to the performance measure if its marginal has a high variance and vice versa. Due to the marginalizing over all possible hyperparameter values and combinations, it gives a global overview of the important hyperparameters (opposed to for example ablation analysis, which gives a more local view) (Biedenkapp et al., 2018). Functional ANOVA has been used for studying the importance of hyperparameters of standard machine learning models such as support vector machines, random forests, Adaboost, and residual neural networks (van Rijn & Hutter, 2018; Sharma et al., 2019). We refer to Hutter et al. (2014) for a complete description.

Fig. 1
figure 1

Marginals for a quantum neural network with validation accuracy as performance on the banknote-authentication dataset. Marginal plots of other datasets are shown in Fig. 9 in Appendix A. The hyperparameters correspond to the number of layers, also known as depth (a), the learning rate used during training (b), and their combination (c). The hyperparameter values for the learning rate are on a log scale. When considered individually, we see, for instance, that depth and learning rate should be set at a reasonable price for better performances. However, when grouped together, the learning rate seems most influential

2.2 Supervised learning with parameterized quantum circuits

2.2.1 Basics of quantum computing

In quantum computing, computations are performed by manipulating qubits, similar to classical computing with bits. A system of n qubits is represented by a \(2^n\)-dimensional complex vector in the Hilbert space \(\mathcal {H}=(\mathbb {C}^2)^{\otimes n}\). This complex vector describes the state of the system \({\left| {\psi }\right\rangle } \in \mathcal {H}\) and is of unit norm \(\langle {\psi }|{\psi }\rangle =1\). The bra-ket notation is used to describe vectors \({\left| {\psi }\right\rangle }\), their conjugate transpose \({\left\langle {\psi }\right| }\) and inner-products \(\langle {\psi }|{\psi '}\rangle\) in \(\mathcal {H}\). Single-qubit computational basis states are given by \({\left| {0}\right\rangle }=(1,0)^T, {\left| {1}\right\rangle }=(0,1)^T\), and their tensor products describe general computational basis states, e.g., \({\left| {10}\right\rangle } = {\left| {1}\right\rangle }\otimes {\left| {0}\right\rangle } = (0,0,1,0)\).

The quantum state is modified with unitary operations or gates U acting on \(\mathcal {H}\). This computation can be represented by a quantum circuit (see Fig. 2). When a gate U acts non-trivially only on a subset \(S \subseteq [n]\) of qubits, we denote such operation \(U\otimes \mathbbm {1}_{[n]\backslash S}\). In this work, we use, the Hadamard gate H, the single-qubit Pauli gates XYZ and their associated rotation gates \(R_X, R_Y, R_Z\):

$$\begin{aligned} \begin{gathered} H = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 &{} 1 \\ 1 &{} -1 \end{pmatrix}, Z = \begin{pmatrix} 1 &{} 0 \\ 0 &{} -1 \end{pmatrix}, R_Z(w) = \exp (-i \frac{w}{2} Z),\\ Y = \begin{pmatrix} 0 &{} -i \\ i &{} 0 \end{pmatrix}, R_Y(w) = \exp (-i \frac{w}{2} Y), X = \begin{pmatrix} 0 &{} 1 \\ 1 &{} 0 \end{pmatrix}, R_X(w) = \exp (-i \frac{w}{2} X), \end{gathered} \end{aligned}$$
(1)

The rotation angles are denoted by \(w \in \mathbb {R}\) (typically restricted in the range of \([-\pi , \pi ])\). The matrix form of a 2-qubit controlled-Z gate and the \(\sqrt{\text {iSWAP}}\) (also denoted as sqiswap) gate is given by

$$\begin{aligned} \frac{1}{\sqrt{2}} \begin{pmatrix} \sqrt{2} &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} i &{} 0 \\ 0 &{} i &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} \sqrt{2} \end{pmatrix}. \end{aligned}$$
(2)

Measurements are carried out at the end of a quantum circuit to obtain bitstrings. Such measurement operation is described by a Hermitian operator O called an observable. Its spectral decomposition \(O=\sum _m \lambda _m P_m\) in terms of eigenvalues \(\lambda _m\) and orthogonal projections \(P_m\) defines the outcomes of this measurement, according to the Born rule: a measured state \({\left| {\psi }\right\rangle }\) gives the outcome \(\lambda _m\) and gets projected onto the state \(P_m {\left| {\psi }\right\rangle } / \!\ \sqrt{p(m})\) with probability \(p(m) = {\left\langle {\psi }\right| } P_m {\left| {\psi }\right\rangle } = \langle {P_m}\rangle _\psi\). The expectation value of the observable O with respect to \({\left| {\psi }\right\rangle }\) is \(\mathbb {E}_\psi [O] = \sum _m p(m) \lambda _m = \langle {O}\rangle _{\psi }\). We refer to Nielsen and Chuang (2011) for more basic concepts of quantum computing and follow with parameterized quantum circuits.

2.2.2 Parameterized quantum circuits

A parameterized quantum circuit (also called ansatz) can be represented by a quantum circuit with adjustable real-valued parameters \({\textbf {w}}\). A unitary \(U({\textbf {w}})\) then defines the quantum circuit by acting on a fixed n-qubit state (e.g., \({\left| {0^{\otimes n}}\right\rangle }\)). The ansatz may be constructed by exploiting the nature of the problem (typically the case in chemistry (Moll, 2018) or optimization (Farhi et al., 2014)) or with a problem-independent generic construction. The latter are often designated as hardware-efficient-ansatz.

For a machine learning task, the unitary \(U({\textbf {w}})\) encodes an input data instance \(x \in \mathbb {R}^d\) and is parameterized by a trainable vector \({\textbf {w}}\). Many designs exist, but hardware-efficient parameterized quantum circuit (Kandala et al., 2017) with an alternating-layered architecture is often considered in quantum machine learning when no information on the structure of the data or the problem is known. This architecture is depicted in an example presented in Fig. 2 and essentially consists of an alternation of encoding unitaries \(U_\text {enc}\) and variational unitaries \(U_\text {var}\). In the example, \(U_\text {enc}\) is composed of single-qubit rotations \(R_X\), and \(U_\text {var}\) of single-qubit rotations \(R_Y, R_Z\) and entangling Ctrl-Z gates, represented as in Fig. 2, forming the entangling part of the circuit. Such an entangling part denoted \(U_\text {ent}\) can be defined by connectivity between qubits.

These parameterized quantum circuits are similar to neural networks where the circuit architecture is fixed, and the gate parameters are optimized by a classical optimizer such as gradient descent. Hence, they have also been termed quantum neural networks. The parameterized layer can be repeated multiple times, which increases its expressive power like neural networks (Sim et al., 2019). The data encoding strategy (such as reusing the encoding layer multiple times in the circuit - a strategy called data reuploading) also influences the latter (Pérez-Salinas et al., 2020; Schuld et al., 2021).

The user can define the observable(s) and the post-processing method to convert the circuit outputs into a prediction in the case of supervised learning. In practice, observables based on the single-qubit Z operator are used. When applied on \(m \le n\) qubits, the observable is represented by a \(2^m-1\) square diagonal matrix with \(\{-1,1\}\) values and is denoted \(\mathcal {O} = Z \otimes Z \otimes \cdots \otimes Z\).

Having introduced parameterized quantum circuits, we present the hyperparameters of the models, the configuration space, and the experimental setup for our functional ANOVA-based hyperparameter importance study.

Fig. 2
figure 2

Parameterized quantum circuit architecture example with 4 qubits and ring connectivity (qubit 1 is connected to 2, 2 to 3, 3 to 4, and 4 to 1 makes a ring). The first layer of \(R_X\) is the encoding layer \(U_\text {enc}\), taking a data instance \(x \in \mathbb {R}^4\) as input. It is followed by the entangling part with Ctrl-Z gates. Finally, a variational layer \(U_\text {var}\) is applied. Eventually, we do measurements to be converted into predictions for a supervised task. The dashed part can be repeated many times to increase the expressive power of the model

2.2.3 Related works

To apply functional ANOVA for performing hyperparameter importance, we performed an extensive literature review on parameterized quantum circuits for machine learning (Benedetti et al., 2019; Havlíček et al., 2019; Heimann et al., 2022; Jerbi et al., 2023, 2021; Liu & Wang, 2018; Marshall et al., 2022; Mensa et al., 2022; Mitarai et al., 2018; Peters et al., 2021; Schetakis et al., 2021; Skolik et al., 2022; Wang et al., 2022a, b; Wossnig, 2021; Zoufal et al., 2019) as well as quantum machine learning software (ANIS et al., 2021; Bergholm et al., 2018; Broughton, 2020). This resulted in a set of hyperparameters and configuration space presented in Section 3.1. Several works also study the performances of quantum neural networks on binary classification tasks, but more for benchmarking purposes (Mathur et al., 2021; Schetakis et al., 2021). Concerning hyperparameter optimization, several directions are taken, from quantum architecture search (Du et al., 2022; Zhang et al., 2022b), to developing hyperparameter optimization techniques (Sagingalieva et al., 2022), or applying concepts from automated machine learning to quantum models (G’omez et al., 2022). In our case, we use insights from the hyperparameter importance study to steer a hyperparameter optimization procedure, which can be considered an automated machine learning concept.

3 Methods

In this section, we describe the model with its hyperparameters and define our methodology.

3.1 Hyperparameters, configuration space and simulations

Many parameterized quantum circuit designs have been introduced based on motivated research questions and contributions or the addressed problem. We first carried out an extensive literature review on parameterized quantum circuits for machine learning (Benedetti et al., 2019; Havlíček et al., 2019; Heimann et al., 2022; Jerbi et al., 2023, 2021; Liu & Wang, 2018; Marshall et al., 2022; Mensa et al., 2022; Mitarai et al., 2018; Peters et al., 2021; Schetakis et al., 2021; Skolik et al., 2022; Wang et al., 2022a, b; Wossnig, 2021; Zoufal et al., 2019) as well as quantum machine learning software (ANIS et al., 2021; Bergholm et al., 2018; Broughton, 2020). We aggregated and translated the design choices into a set of hyperparameters and configuration space, resulting in a list of 10 hyperparameters, presented in Table 1. We choose them, so we balance between having well-known hyperparameters that are expected to be essential and less considered ones in the literature. For instance, many use Adam (Kingma & Ba, 2015) as the optimization algorithm whose learning rate should usually be well set. In contrast, the choice of the entangling gate is generally fixed.

From the literature, we expect the data encoding strategy/circuit to be essential. We set two main forms for \(U_\text {enc}\). The first one is the hardware-efficient \(\bigotimes _{i=1}^{n} R_X(x_i)\). The second form from (Bergholm et al., 2018; Jerbi et al., 2023; Havlíček et al., 2019), translates to an instantaneous quantum polynomial (IQP) circuit and is formulated as:

$$\begin{aligned}{} & {} U_\text {enc}(\varvec{x}) = U_z(\varvec{x})H^{\otimes n} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} U_z(\varvec{x}) = \exp \left( -\text {i}\pi \left[ \sum _{i=1}^{n} x_i Z_i + \sum _{\begin{array}{c} j=1,\\ j>i \end{array}}^{n} x_ix_j Z_iZ_j\right] \right) . \end{aligned}$$
(4)

We can also make a model more expressive using data-reuploading (Pérez-Salinas et al., 2020; Schuld et al., 2021; Jerbi et al., 2021; Skolik et al., 2022) or by pre-processing the input (Schuld et al., 2021) (sometimes used in encoding strategies where input features are fed into Pauli rotations). In this work, as a possible pre-processing technique, we choose a usual activation function \(tanh\). Its range is \([-1,1]\), similar to the data features during training after the normalization step.

The list of hyperparameters we take into account is non-exhaustive. Therefore, it can be extended at will, at the cost of more software engineering and a budget for running experiments. All the quantum circuits were implemented using the Cirq library (Google: Cirq, 2018) and numerically simulated with the TensorFlow Quantum library (Broughton, 2020) using an exact statevector simulator (without quantum noise). An estimated total compute time for all the simulations performed was \(31\,000\) CPU-hours.

Table 1 List of hyperparameters considered for hyperparameter importance for a quantum neural network, as we named them in our TensorFlow Quantum code

3.2 Assessing hyperparameter importance

Having set the previous list of hyperparameters and the configuration space, we perform the hyperparameter importance analysis using functional ANOVA. Firstly, we sample various random configurations per dataset and apply the models to measure their performance according to a measure we are interested in, in this case, predictive accuracy. The sampled configurations and performances are used to train a tree-based surrogate model (i.e., a random forest (Breiman, 2001)) that can map any configuration to the performance measure. Each hyperparameter value is given as input, and the output is predictive accuracy. On top of the hyperparameters, we can also add the number of epochs to make the surrogate aware of specific budgets.

Next, we verify the performance of the surrogate models. We utilize regression metrics commonly used in surrogate benchmarks (Eggensperger et al., 2015) to discard datasets where surrogates perform poorly, as they can deteriorate the quality of the results. Per dataset, the R2 score is calculated over the actual performance per configuration versus the predicted performance per configuration. If this score is too low (below 0.75), the surrogate model is not accurate enough, and the dataset is excluded from further analysis. Finally, we obtain the marginal contribution of each hyperparameter across all datasets, which can be used to infer a ranking of their general importance. A verification step similar to van Rijn and Hutter (2018) is carried out to confirm the inferred ranking previously obtained. We explain such a procedure in the following section.

3.3 Verifying hyperparameter importance

Fig. 3
figure 3

Performances of \(1\,000\) quantum machine learning models defined by different configurations of hyperparameters over each dataset. The metric of interest we use is the 10-fold cross-validation accuracy. We take the best-achieved metric per model trained over 100 epochs

In order to verify the hyperparameter importance ranking, the authors of van Rijn and Hutter (2018) proposed a random search procedure in which one hyperparameter is fixed to a given value, and all other hyperparameters are optimized. The assumption is that when an important hyperparameter is fixed to a given value, the result of the optimization procedure is worse than when an unimportant hyperparameter is fixed to a given value. When fixing a hyperparameter to a given value, the procedure is repeated several times with different values to avoid bias, and as such, the optimization procedure is carried out several times. Formally, for each hyperparameter \(\theta _j\), we measure \(y^*_{j,f}\) as the result of a random search for maximizing the metric, fixing \(\theta _j\) to a given value \(f \in F_j, F_j \subseteq \Theta _j\). For categorical \(\theta _j\) with domain \(\Theta _j\), \(F_j=\Theta _j\) is used. For numeric \(\theta _j\), following (van Rijn & Hutter, 2018), we use a set of 10 values spread uniformly over \(\theta _j\)’s range. We then compute \(y^*_j = \frac{1}{|F_j|}\sum _{f \in F_j}y^*_{j,f}\), representing the score when not optimizing hyperparameter \(\theta _j\), averaged over fixing \(\theta _j\) to various values it can take. Hyperparameters with lower values for \(y^*_j\) are assumed to be more important since the performance should deteriorate more when set sub-optimally.

In our study, we extend the verification step to be used on the scale of quantum machine learning models. As quantum simulations can be costly, we use the predictions of the surrogate instead of fitting new quantum models during the verification experiment. The surrogates yield predictions \(\hat{y}\) for the performance of arbitrary hyperparameter settings sampled during a random search, serving to compute \(y^*_{j,f}\). This is also the reason why we performed a second phase evaluation of the built-in surrogates. Surrogates that perform poorly can reduce the quality of built marginals, resulting in inferred conclusions with potential bias.

3.4 Deriving data-driven priors for hyperparameter optimization

From the functional ANOVA framework, we obtain insights into which hyperparameters are important to optimize and which are not at the hyperparameter level. Additionally, we will explore ways to utilize well-performing hyperparameter values across datasets. The authors of van Rijn and Hutter (2018) demonstrated a procedure for this that learns data-driven priors across datasets from good hyperparameter values. The priors are then used within a hyperparameter optimization procedure such as hyperband (Li et al., 2017).

Fig. 4
figure 4

Procedure outlining how we learn data-driven priors, following (van Rijn & Hutter, 2018). The top pane shows three individual datasets (utilized for training). Each dot here represents a configuration that was run on these datasets, with the red dots being the best-performing configurations. The bottom pane shows how a kernel density estimator can be utilized to learn data-driven priors inferred from the best configurations. This figure shows how this is done for a single hyperparameter, although it can be generalized to multi-dimensional configuration spaces

Figure 4 outlines this procedure. For each dataset, we utilize past experiments and determine which configurations yielded the best performance. For each dataset not used for evaluation, we identify the best-N (N to be determined by the user, but we found that 10 and 20 works) performing configurations in terms of the metric that we are interested in (e.g., predictive accuracy). Per hyperparameter, we now gather the values that ended up in the best configurations. We fit a 1-dimensional distribution per hyperparameter (Larraanaga & Lozano, 2001). For numeric hyperparameters, we use a kernel density estimator; for categorical hyperparameters, we sample from a multinomial distribution (or Bernoulli in case of 2 possible values) whose parameters are set according to the frequencies of values. This distribution now represents a learned prior of good hyperparameter values and can be used to sample configurations in a hyperparameter optimization process.

4 Dataset and inclusion criteria

We consider classical datasets to apply the machinery of quantum neural networks and investigate the importance of the hyperparameters that were introduced before. Similarly to van Rijn and Hutter (2018), we use datasets from the OpenML-CC18 benchmark suite (Bischl et al., 2021). We adhere to a commonly used trend in the quantum community in our study; i.e., we only consider the datasets whose number of features is equal to the number of qubits available under the current scale of quantum simulations on a server. We limit this study to the case where the number of features is less than 20 after pre-processing, as simulating quantum machine learning algorithms is computationally expensive.Footnote 2 Hence, our first step was identifying which datasets fit this criterion. We include all datasets from the OpenML-CC18 with 20 or fewer features after categorical features have been one-hot-encoded and constant features are removed. Afterwards, the input variables are scaled to unit variance as a normalization step. Finally, the scaling constants are calculated on the training data and applied to the test data.

The final list of datasets is given in Table 2. In total, 7 datasets fitted the criterion considered in this study. For all of them, we picked the OpenML Task ID giving the 10-fold cross-validation task. A quantum model is then applied using the latter procedure with the aforementioned pre-processing steps.

Table 2 List of datasets used in this study. The number of features is obtained after a usual pre-processing used in machine learning methods, such as one-hot encoding

5 Results of hyperparameter importance

In this section, we present the results of the hyperparameter importance study.

5.1 Performance distributions per dataset

We independently sampled \(1\,000\) hyperparameter configurations for each dataset and ran the simulation of quantum models for 100 epochs as budget. We recorded the best validation accuracy obtained over 100 epochs as a performance measure. Figure 3 shows the distribution of the 10-fold cross-validation accuracy obtained per dataset. We observe the impact of hyperparameter optimization by the difference between the least performing and the best model configuration. For instance, on the wilt dataset, the best-performing model gets an accuracy close to 1, and the least-performing model obtains a performance below 0.25. We can also see that some datasets present a smaller spread of performances. For example, ilpd and blood-transfusion-service-center are in this case. Hyperparameter optimization does not seem to have a real effect because most hyperparameter configurations give the same result. As such, the surrogates could not differentiate between various configurations. For most datasets, hyperparameter optimization is vital for getting high performances per dataset and detecting datasets where the importance study can be applied.

5.2 Surrogate verification

Functional ANOVA relies on an internal surrogate model to determine the marginal contribution per hyperparameter. If this surrogate model is not accurate, this can severely limit the conclusions drawn from functional ANOVA. Surrogate models are trained to map the configuration to a particular performance score. Each hyperparameter value is given as an input, and the output is the performance measure of interest, in this case, predictive accuracy. As such, in this experiment, we measure the performance of the surrogates without information on the number of epochs. In this experiment, we verify whether the hyperparameters can explain the performances of the models. Table 3 shows the performance of the internal surrogate models. We notice low regression scores for the two datasets (less than 0.75 R2 scores). Hence we remove them from the analysis.

Table 3 Results of the step in which we validate the surrogate models. This table shows performances of the surrogate models built within functional ANOVA over a 10-fold cross-validation procedure. We present the average coefficient of determination (R2), root mean squared error (RMSE), and Spearman’s rank correlation coefficient (CC). These are standard regression metrics for benchmarking surrogate models on hyperparameters (Eggensperger et al., 2015). The surrogates over ilpd and blood-transfusion-service-center obtain low scores (less than .75 R2). Hence we remove them from the study

5.3 Marginal contributions

For functional ANOVA, we used 128 trees for the surrogate model. Figure 5a,b shows the marginal contribution of each hyperparameter over the remaining 5 datasets. We distinguish visually 3 primary levels of importance. According to these results, the learning rate, depth, data encoding circuit, and reuploading strategy are critical. These results are in line with our expectations. According to functional ANOVA, the entangler gate, connectivity, and whether we use \(R_X\) gates in the variational layer are the least important. Hence, our results reveal new insights into these hyperparameters that are not considered in general.

Fig. 5
figure 5

The marginal contributions per dataset are presented as a the variance contribution and b the difference between the minimal and maximal value of the marginal of each hyperparameter. The hyperparameters are sorted from the least to the most important using the median. We distinguish from the plot 3 primary levels of importance

5.4 Verification of important hyperparameters

Fig. 6
figure 6

Verification experiment of the importance of the hyperparameters. A random search procedure is used for up to 4096 iterations, excluding one parameter at a time. A lower curve means the hyperparameter is found to be less important

In line with the work of van Rijn and Hutter (2018), we perform an additional verification experiment that verifies whether functional ANOVA outcomes align with our expectations. However, the verification procedure involves an expensive, post hoc analysis: a random search procedure fixing one hyperparameter at a time. Therefore, as our quantum simulations are costly, we used the surrogate models fitted on the current dataset considered over the \(1\,000\) configurations obtained initially to predict the performances one would obtain when presented with a new configuration. Figure 6 shows the average rank of each run of random search, labeled with the hyperparameter whose value was fixed to a default value. A high rank implies poor performance compared to the other configurations, meaning that tuning this hyperparameter would have been important. We again witness the 3 levels of importance, with almost the same order obtained. However, the input_activation_function is found to be more important while the batch size is less. More simulations with more datasets may be required to validate the importance. However, we empirically retrieve the importance of well-known hyperparameters while considering less important ones. Hence functional ANOVA becomes an interesting tool for quantum machine learning in practice. Next, we demonstrate this by using the obtained results to improve a hyperparameter optimization algorithm.

5.5 Efficiency of data-driven priors for hyperparameter optimization

We will utilize the performance data to build data-driven priors over what are good hyperparameter values across datasets. The data-driven priors can then be utilized in hyperparameter optimization methods, such as hyperband (Li et al., 2017), to replace uniform sampling strategies. We will use the experimental data to obtain a set of data-driven priors (kernel density estimators and multinomial) over hyperparameter values, similar to van Rijn and Hutter (2018). We will demonstrate that employing the data-driven priors improves the results obtained with hyperband with respect to the default uniform sampling.

We will utilize hyperband to verify whether the data-driven priors are more effective than uniform sampling. Generally, in hyperband, a number of hyperparameter configurations are sampled uniformly (where each hyperparameter value is sampled independently) to train machine learning models given a value of a user-defined notion of budget (e.g. the number of epochs). A proportion of the configurations that have performed best are kept for the next iteration where the budget value is increased until one configuration remains. In our case, uniform sampling can be replaced by the data-driven priors which can be sampled to obtain hyperparameter values. In order to get a balanced assessment under which conditions these data-driven priors work well, we also vary the hyperparameters of hyperband itself. While hyperband is arguably robust against ill-specified hyperparameters, we want to see whether the data-driven priors work well across the various options. Table 4 shows the various hyperparameters of hyperband that we considered. We ran an experiment on each combination of these hyperparameters. Note that, to keep the computational cost manageable, we do not run the actual algorithms that hyperband is exploring but use surrogate models instead. The latter are built similarly to the previously considered surrogate models in functional ANOVA (see Sec. 5.2); however, we also incorporate the number of epochs as a feature for this experiment. We do so to use the number of epochs as the notion of budget for hyperband. Configurations are sampled randomly with an increased number of epochs for a given hyperband iteration, onto which performances are determined with the surrogate models.

Table 4 List of hyperparameters considered for hyperband experiments

Figure 10 in Appendix B illustrates example kernel density estimators for a quantum neural network’s two most important hyperparameters. We see from the figure that: (i) the default values of Adam’s learning rate used by the deep learning practitioners have a reasonable default, and (ii) the significance of using a higher depth in parameterized quantum circuits as it makes them more expressive (a desirable property of any machine learning model). We note that in order to evaluate the data-driven priors on a given dataset, we build these priors on all other datasets except this one (leave one dataset out). As such, we have a slightly different set of data-driven priors for each experiment.

For our experiments, we run hyperband with different values for its hyperparameters with 15 random seeds per dataset, using the previously-mentioned surrogate models to obtain performance scores per configuration (to economize on the otherwise extensive runtime). Figures 7 and 8 illustrate our results. For each dataset, the difference in accuracy between the two sampling strategies is depicted: positive values indicate that sampling based on data-driven priors performs better. Each data point represents a given configuration of hyperband with data-driven priors compared to a configuration of hyperband with uniform sampling. The violin plots aggregate over various datasets (5), random seeds (15) and various hyperparameter settings of hyperband (see Table 4).

Fig. 7
figure 7

Relative difference in performance improvement between two instances of hyperband, one sampling based on the learned priors and one using uniform sampling. Positive values indicate superior performance by using data-driven priors and vice versa. We show the distribution of performances over all combinations of hyperband parameters from Table 4, random seeds and datasets (a), and for the data-driven prior hyperband run achieving the largest average performance gain over the uniform priors (averaged over all random seeds and datasets) (b) and for the data-driven prior hyperband run achieving the lowest average performance gain over the uniform priors (averaged over all random seeds and datasets) (c). We note that the data-driven priors are superior to the uniform sampling across all tried configurations of hyperband

Fig. 8
figure 8

Relative difference in performance improvement between two instances of hyperband, one sampling based on the data-driven priors and one using uniform sampling. We show the distribution of performances over all combinations of hyperband parameters from Table 4, random seeds, and datasets. The figure from left (a) to right (d) shows a slight increase in average improvement as the number of hyperparameters considered for building data-driven priors grows. Here, lr, depth, activation, and reuploading denote Adam optimizer’s learning rate, quantum circuit depth, activation function, and data reuploading

In general, the results suggest that using data-driven priors from good values of hyperparameters can aid in finding better configurations of quantum neural networks. Indeed, as shown in Fig. 7 a), the average improvement is \(0.53\%\), and the maximum was \(6.11\%\) for all hyperband hyperparameters tried. From Fig. 7 b), the best run in terms of average performances achieved an average improvement of \(1.41\%\), and the maximum was \(4.04\%\). Moreover, from Fig. 7 c), the worst run in terms of average performance achieved an average improvement of \(0.14\%\), and the maximum was \(1.21\%\). Furthermore, the improvement was obtained across all hyperband runs in \(77.2 \%\) of cases. Finally, Fig. 8 shows that the average improvement generally increases as we add a hyperparameter according to its importance ranking when learning data-driven priors. This demonstrates the benefits of balancing between exploration and exploitation using hyperparameter importance when performing hyperparameter optimization.

More simulations with more datasets can be performed as methods enabling the usage of quantum neural networks on datasets with a number of features greater than the number of qubits are being developed (Haug et al., 2023; Peters et al., 2021). However, we verified empirically that the importance information obtained from functional ANOVA could be helpful for hyperparameter optimization. More sophisticated hyperparameter optimization algorithms can also benefit from such studies.

6 Limitations

We discuss three limitations to better place the work in a context that could guide further research. Firstly, the work heavily leans on the hyperparameter importance definition of functional ANOVA. While this is a well-established measure built on a solid foundation of prior work (Sobol, 1993; Hutter et al., 2014; van Rijn & Hutter, 2018) there are also downsides. The involvement of the marginal has a very desirable property in that it is not conditioned on the value of other hyperparameters, as it averages over all possible values for all other hyperparameters. Following this definition, it considers both very well-performing hyperparameter configurations and all mediocre-performing configurations. Thereby, it gives good information on what hyperparameters are generally important, giving a global picture. However, in some cases, we might be more interested in the hyperparameters that do not seem important on a global level, are responsible for obtaining the final bits of performance. These are usually hard to detect by functional ANOVA. The work of Hooker (2007) proposes a way to filter out bad-performing configurations, compromising the global overview of functional ANOVA in favour of the aforementioned important details. However, it is essential to note that every definition of hyperparameter importance always comes with a particular definition bias that should be clearly communicated.

Secondly, due to the high amount of required CPU resources, we have decided to perform both the verification experiment and the evaluation of the data-driven priors on learned surrogate models. The surrogate models are learned based on the experimental data we have gathered and evaluated extensively (see Section 5.2). However, like any model, the surrogates are not perfect, and by introducing a surrogate in the experimentation, a bias is induced in favour of economizing on CPU time. Since the experiments of van Rijn and Hutter (2018) were run on actual models (rather than the surrogate models) and already confirmed the high correlation between the functional ANOVA results and the verification experiment, we believe that using the surrogate models for this paper is a valid approach.

Finally, by means of this study, we have considered a family of quantum neural network architectures, whereas the literature has proposed many more architectures. While it is likely that some hyperparameter importance results generalize across different architectures (e.g., the learning rate and the depth will likely always be important), it should be noted that only limited conclusions can be transferred across different architectures. Ideally, this study can be repeated over a broad range of quantum neural network architectures, identifying even more general patterns. For instance, various hyperparameters of the models can have an influence on the trainability and optimization of the model. It is known that quantum neural networks can suffer from the phenomenon of barren plateaus (McClean et al., 2018), similar to the vanishing gradients problem inherent in neural networks, resulting in trainability issues. Several hyperparameters such as the circuit architecture (Napp, 2022), input state (Larocca et al., 2022) and the depth (Cerezo et al., 2021) are related to barren plateaus. It would be interesting to extend this study by modifying or adding into the hyperparameters more quantum circuit specifications designed to handle barren plateaus (Pesah et al., 2021; Grant et al., 2019; Zhang et al., 2022a; Sack et al., 2022) and tailored optimization procedures for quantum machine learning (Moussa et al., 2022; Kulshrestha & Safro, 2022).

7 Conclusion

In this work, we assess the importance of several hyperparameters related to quantum neural networks for classification using the functional ANOVA framework. Our experiments rely on OpenML datasets matching the current scale of quantum hardware simulations (i.e., datasets with at most 20 features after applying pre-processing operators, hence using 20 qubits). We selected and presented the hyperparameters based on an investigation of quantum computing literature and software. Firstly, hyperparameter optimization highlighted datasets where we observed high variance in the performance across configurations (see Fig. 3). In particular, for the ‘wilt’ dataset, the performances were spread from 25 to 100%. This further underlines the importance of hyperparameter optimization for these datasets. There were also datasets where this variance was negligible.

Following (van Rijn & Hutter, 2018), we utilized functional ANOVA to attribute the variance in performance to the various (combinations of) hyperparameters. Hyperparameters that attribute to a high variance are considered important, whereas hyperparameters that attribute to only a small variance are considered unimportant. From our results, we distinguished 3 primary levels of importance. On the one hand, Adam’s learning rate, depth, and data encoding strategy are found to be very important, as we expected. On the other hand, the less considered hyperparameters, such as the particular choice of the entangling gate and using 3 rotation types in the variational layer, are in the least important group. Hence, our experiment confirmed expected patterns and revealed new insights for the selection of the quantum model. We confirmed these results by cross-checking these results against an extensive experiment, where we ran several hyperparameter optimization processes, each time optimizing all but one hyperparameter. When such a hyperparameter optimization run was still successful even when a given hyperparameter was not optimized, that indicates that the hyperparameter was not important, and vice-versa. There was a high correlation between the hyperparameters that were found to be important by both results. For example, both results rank learning rate, depth, and use reuploading among the most important hyperparameters.

Finally, following (van Rijn & Hutter, 2018), we utilize good configurations across datasets to create data-driven priors that can be used in hyperparameter optimization processes to sample from (opposed to the commonly-used uniform or log-uniform prior). We demonstrated that using prior information of good values of hyperparameters benefits the hyperparameter optimization process by comparing a hyperband optimization process with the data-driven priors against a hyperband optimization process with uniform priors. This experiment was repeated with many different settings for hyperband, ranging in many different hyperparameter configurations of the hyperparameter optimization method. The results indicate that the data-driven priors outperform the uniform priors in \(77.2 \%\) of the cases.

For future work, we plan further to investigate methods from the field of automated machine learning to be applied to quantum neural networks (Brazdil et al., 2022; Feurer et al., 2020; Mohr & van Rijn, 2022). Indeed, our experiments have shown the importance of hyperparameter optimization, which should become standard practice and part of the protocols applied within the community. We further envision functional ANOVA to be employed in future works related to quantum machine learning and understanding how to apply quantum models in practice. For instance, it would be interesting to consider quantum data, for which quantum machine learning models may have an advantage. Plus, extending hyperparameter importance to techniques for scaling to a large number of features with the number of qubits, such as dimensionality reduction or divide-and-conquer techniques, can be left for future work. Finally, this type of study can also be extended to different noisy hardware and towards algorithm/model selection and design. For example, choosing which hardware works best for machine learning tasks becomes possible if we have access to a cluster of different quantum computers. One could also extend our work with meta-learning (Brazdil et al., 2022), where a model configuration is selected based on meta-features created from dataset features. Such types of studies already exist for parameterized quantum circuits applied to combinatorial optimization (Moussa et al., 2020, 2022; Sauvage et al., 2021).