1 Introduction

Data pre-processing techniques, along with exploratory analysis and feature engineering, are standard steps within pipelines for predictive model training and deployment [1]. In practice, before feeding data to a learning algorithm, it is common to cleanse data sets in order to improve predictive performance and generalization capability of models.

Feature selection (FS) is a pre-processing technique, in which a subset of features is selected from a data set that may contain more relevant information than the total set of features, due to co-linearity, redundant, or constant features within the total data set. Benefits to utilizing FS within a pre-processing step include reduced data dimensionality, better scaling for learning algorithms in practice, and reduced costs in regard to model training and fitting. Feature selection differs from other techniques such as Feature Engineering in that it does not create a projection of linear combinations of features into a new subspace, or a change of basis using an eigendecomposition of a covariance data matrix, as in the case of Principal Components Analysis [2], but rather filters down and eliminates features with low information in regards to a target variable. This is an important property since the sparse representation and preservation of the original feature space allows for higher visibility and explainability when accounting for predictive model output [3]. Feature Selection techniques may leverage an output from an estimator or model in the selection process, also known as wrapper methods [4] or may use heuristics and select an optimal subset, also known as filter methods. Features selection methods have been used in a variety of applications, including areas such as character recognition and healthcare[5,6,7,8]. In this work, we examine using quantum-assisted methods to find combinations of features of a subset of size k, which may show improved performance characteristics and a decrease in the input feature space for our learning algorithms, which is a known NP-Hard combinatorial optimization problem [9].

Our contribution is as follows: framing the combinatorial search as a binary quadratic model, we examine leveraging a quantum device to search the combinatorial space of finding optimal feature subsets of size k. We investigate various distance and correlation metrics, which we compute classically for the formulation of the binary quadratic optimization problem. We apply the heuristic of Maximal Relevancy Minimal Redundancy (MRMR) [10] as an approach to the feature selection problem, with the idea that we want to find a subset of features that may have a strong predictive signal in regards to a target variable (Maximal Relevancy), but low pairwise feature correlation between independent variables (Minimal Redundancy). We train and compare two types of regression models using quantum-assisted feature selection along with other benchmark selection methods including all features, greedy selection, and a wrapper method, over two data sets with continuous target values. We examine model predictive performance using this methodology and apply it to a real-world problem of data pre-processing for predictive models for vehicle prices. We show that by using quantum-assisted routines, we find combinations of features that increase the predictive quality of models on validation sets of data, and improve upon our benchmarks of all features, greedy feature selection, and recursive feature elimination.

1.1 Related work

The problem of Feature selection is well-studied in the literature [11, 12]. Recent approaches apply mutual-information-based metrics for feature selection for supervised learning in regards to classification applications [13]. In recent years, new correlation metrics have been proposed which may have more expressive power in measuring relationships between variables. Examples of this are Maximal Information Coefficient (MIC) [14] and Generalized Mean Information Coefficient (GMIC) [15], which introduce the concept of equability in variable relationships and may be more robust when dealing with non-linear relationships than correlation statistics such as Pearson correlation coefficient, which assume linear relationships.

In the current era of quantum computing, the availability of devices leveraging quantum processing units (QPUs) has come online, and applications have been developed leveraging these devices for solving real-world problems within various industries. For example, in [16], the authors leveraged a quantum annealing system to optimize traffic flows around the city of Beijing, and the authors in [17] showed how to price options using quantum algorithms run on a gate-model quantum chip.

The feature subset selection problem was formulated by the authors in [18] as a quadratic program, where mutual information and Pearson correlation coefficient were used to calculate the matrix \(\textbf{Q}\), or a symmetric positive semi-definite matrix representing quadratic terms for minimizing the multi-variate objective function. There have been research efforts to apply quantum annealing to searching feature space for optimal subsets using mutual information-based formulations of interactions and linear terms for Ising spin-glass models and quadratic unconstrained binary optimization [19, 20]. Some of these efforts have included complexity considerations in regards to scaling for the quantum-assisted feature subset selection problem, claiming performance gains of \(O(1/m^2)\) versus \(O(mn^2)\) for classical computation [20]. This work in particular, claims a bound for the quantum-assisted routine given by the size of the minimum gap \(g(t) = E0-E1\) in the energy eigenvalues during in the annealing regime, with a resulting time complexity of \(T = O(\frac{1}{g^2min})\) where T is the upper time limit.

In regards to the price prediction problem, this has been well studied in the literature for machine learning research [21,22,23]. In [21], the authors built predictive models for price prediction for used cars in Mauritius. The models included Naive Bayes and Decision tree algorithms, which contained a classification step with reported accuracy rates in a range of 60–70%, and achieved a mean error of 51000 and 27000 for the regression component using linear regression. Monburinon et. al tested various regression models for price prediction for German used cars in [22], with gradient-boosted decision trees outperforming random forest and multiple regression (mean squared error = 0.28). In [23] the authors applied feature selection techniques for multiple regression models for price prediction and reported score in regards to model fitness (\(R^2\) = 0.9861).

2 Methods

Consider a data set:

$$\begin{aligned} \mathrm{I\!D} = \{\textbf{X}\in {\mathrm{I\!R}}^{N\times M}, \textbf{y}\in {\mathrm{I\!R}} ^N\} \end{aligned}$$

where \(\textbf{X}\) is a data matrix of size \(N \times M\) of N rows, and M columns, and \(\textbf{y}\) is a column vector of size \(N \times 1\) of N rows, and 1 column. We wish to find an approximate functional mapping or hypothesis of \(h(\textbf{X}) \simeq \textbf{y}\) using various learning algorithms. This is also known as supervised learning, and our goal is to minimize a generalization error for a given loss function between predictions and a target variable in order to make predictions given new data or find the best hypothesis from the hypothesis space which the given learning algorithm encompasses.

In the feature subset selection problem we wish to find a subset of features \(\mathbf {X_k}\) where each row vector \(\mathbf {x_i}\) in \(\textbf{X}\) (where \(i = \{1, 2,... N\}\)) is of reduced column size k, i.e each row vector is filtered down as in {\(x_{i1}, x_{i2},..x_{ik}\}\) where \(k < M\). We may investigate various values of k such that the loss for our function \(h(\mathbf {X_k}) \simeq \textbf{y}\) is less than or equal to \(h(\textbf{X}) \simeq \textbf{y}\) cross-validated on a holdout test set of data. Essentially the feature subset selection problem is one of choosing the optimal subset of features or columns from \(\textbf{X}\) of size k. In this problem, we assume that there exists an optimal subset of size k, which may not be the case for all data sets, where the optimal set \(h(\mathbf {X_k}) = h(\textbf{X})\) or where \(k = M\).

2.1 Relevancy, correlation and distance metrics

In order to compare features with a target variable, and investigate pair-wise relationships amongst the feature set \(\textbf{X}\), we must first define our distance functions with which we compute classically to use as input to formulate the binary optimization problem for the feature subset selection which we use as input for our predictive models. In our experiments, we will switch out and compare each distance function when used to model linear and quadratic terms for a binary quadratic model. Note that we compute these distances classically before using them as input in the form of quadratic terms of a binary quadratic model, which we sample using a quantum annealer to get subsets of features.

We examined Maximal Information Coefficient (MIC), Generalized Mean Information Coefficient (GMIC), Mutual Information (MI), and Pearson Correlation Coefficient (PCC) for this study. Let us define each in the following:

Mutual Information (MI) is defined as:

$$\begin{aligned} MI(\textbf{x, y}) = \sum _{\textbf{x, y}} P(\textbf{x, y}) \ln {{P(\textbf{x, y})}\over {P(\textbf{x}) P(\textbf{y})}} \end{aligned}$$
(1)

In our case, we are looking at mixes of purely continuous as well as continuous and discrete variables, with which there are various strategies for binning continuous values to estimate probabilities for the MI calculation. These include kernel density estimation, binning continuous to discrete variables, and clustering algorithms. For our method we use the k-nn binning strategy from [24] to estimate the MI:

$$\begin{aligned} I(\textbf{x, y}) = \psi {(N)}- \langle \psi {(\textbf{x})}\rangle +\psi {(\kappa )} - \langle \psi {(\textbf{m})}\rangle \end{aligned}$$
(2)

Where \(\psi\) is the digamma function, \(\kappa\) is the distance for the designated nearest neighbor and m is the count of the number of neighbors. For more details please also see [25].

Various algorithms have been proposed to handle a mix of discrete and continuous and purely continuous algorithms based on mapping mutual information to a grid. MIC and GMIC belong in this category.

MIC is defined as [15]:

First we define a Characteristic Matrix C:

$$\begin{aligned} C(\textbf{x, y})_{i,j} = \frac{I^*(\textbf{x,y})_{i,j}}{\log _2 \min \{i, j\}} \end{aligned}$$
(3)

Where \(I^*(\textbf{x,y}\) represents the binned values of \(\textbf{x,y}\) in a grid \(G_{ij}\). Then we obtain the MIC using this characteristic matrix:

$$\begin{aligned} MIC(\textbf{x,y}) = \max _{ij<B(n)}\{C(\textbf{x, y})_{i,j}\} \end{aligned}$$
(4)

where \(B(n) = n^{0.6}\) is a maximal grid size as recommended in the original text [14]. Note that we consider ij in this equation as a product notating grid size.

We can define GMIC by taking the characteristic matrix and extending it to find the maximal characteristic matrix:

$$\begin{aligned} C^*(\textbf{x, y})_{i,j} = \max _{kl<ij)}\{C(\textbf{x, y})_{kl}\} \end{aligned}$$
(5)

Where we again use the terms kl and ij to denote grid sizes. We may then use this to define the GMIC measure as defined in [15].

$$\begin{aligned} GMIC(\textbf{x, y}) = \left( \frac{1}{Z} \sum _{ij<B(n)}(C^*(\textbf{x, y})_{i,j}^p \right) ^{1/p} \end{aligned}$$
(6)

Where \(p \in [-\infty , \infty ]\). For our work, we set \(p = -1\) and Z is the primorial of (ij) where \(ij \le B(n)\) as outlined in the original work [15].

There exists a lively debate in the literature as to the significance of the statistical power of these methodologies [26]. We note that we incur additional overhead in regards to computational cost in allocating grids, as well as tuning parameters such as B(n) in the case of MIC and p in GMIC using these methods.

In our study, we reviewed the performance of the Pearson Correlation Coefficient (PCC). This statistical measure is ubiquitous in science and engineering for measuring linear relationships between variables.

Pearson correlation coefficient is defined as [27]:

$$\begin{aligned} PCC(\textbf{x, y}) = \frac{ \sum _{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum _{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum _{i=1}^{n}(y_i-\bar{y})^2}} \end{aligned}$$
(7)

Where \(\bar{x}\) and \(\bar{y}\) are the mean of \(\textbf{x}\) and \(\textbf{y}\).

2.2 QUBO formulation

In order to formulate our problem as a quadratic binary optimization model (QUBO) we first need to calculate a distance measure for each column vector i in \(\textbf{X}\) vs. \(\textbf{y}\). In this case, we use the absolute value of the distance measure and negate it as we wish to find a minimum for our optimization problem. These values then become the linear terms along the diagonal of the \(\textbf{Q}\) matrix.

With these linear terms, we encode the \(maximal \ relevancy\) portion of our heuristic, although we negate the values as the optimization takes the form of finding the minimum of the domain. By taking the absolute value of the distance function, we give greater weight to features that have higher relevancy or correlation with the target variable by treating positive and negative correlation equally.

Then, we calculate the distance between each pairwise column vector indexed at i and j in \(\textbf{X}\). This allows us to formulate the interactions for the quadratic terms for the binary quadratic model, which become values along the upper diagonal of the \(\textbf{Q}\) matrix. This encodes the \(minimum \ redundancy\) characteristic of our heuristic. We want to find combinations of features that are not correlated, or are more distant from each other while retaining relevancy to the target variable.

$$\begin{aligned} \textbf{Q}_{ij} = {\left\{ \begin{array}{ll} - \mid {distance(X_i, y)} \mid , &{} \text {if}\ i=j \\ \mid {distance(X_i, X_j)} \mid , &{} \text {if}\ i < j \\ 0, \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

We combine these to obtain our QUBO formulation for our optimization problem. For the sake of clarification, we use the \(\mathbf {\omega }\) term here to represent a vector of qubit values. We include a scaling parameter \(\alpha\) that we use to scale the domain of the optimization landscape.

$$\begin{aligned} E({\omega }) = \alpha \sum _{i\le j}{{\omega }}_i\textbf{Q}_{ij}{\omega }_j \qquad {\omega } \in \{0, 1\} \end{aligned}$$
(9)

Finally, we impose a penalty term, such that the resulting sample from our objective function only has k qubits turned "on", or have a value of 1. We introduce a scaling parameter for the penalty term, \(\lambda\), to enforce this constraint.

$$\begin{aligned} E({\omega }) = \alpha \sum _{i\le j}{\omega }_i\textbf{Q}_{ij}\mathbf {\omega }_j + \lambda (\sum _i{\omega }_i - k)^2 \end{aligned}$$
(10)

We then use this formulation to follow an annealing schedule and sample solutions to find a minimum energy E. In our resulting vector \(\mathbf {\mathbf {\omega }}\), which is of M length, our constraint ensures that k qubits have a value of 1, and the rest 0. We then use this vector to filter the data set \(\textbf{X}\) to obtain \(\mathbf {X_k}\), where the k columns are filtered from the index location where \(\mathbf {\omega }_m\) = 1 for { \(\omega _1\), \(\omega _2\), ...,\(\omega _m\)}.

2.3 Data sets

For our experiments, we tested our feature selection algorithms on two data sets, one synthetic, and one drawn from real-world samples.

The first dataset that we used was the Friedman 1 dataset [28]. Here, the data is generated by the function:

$$\begin{aligned} \textbf{y} = 10 \sin (\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \epsilon \end{aligned}$$
(11)

for each row, \(x_i \in \textbf{X}\) where \(\epsilon\) is some random, normally distributed noise, and the rest of the features are independent and drawn from a uniform distribution on the interval of [0,1]. We generated 100 instances of this data for our train/test 70/30 percentage split, with a feature size of 50. We specifically tested this data set in order to study the difference in performance using various mutual information-based metrics within the QUBO formulation of the optimization problem, since the generating function contains a non-linear term, and one of the reported gains in using these distance metrics is in measuring non-linear relationships. Another advantage is experimenting with this data set is that the optimal subset of features is known in advance, and therefore we may determine how accurate the feature selection algorithms are in response to the optimum.

The next data set that we tested was for vehicle price prediction using the open source Auto data set from the UCI machine learning repository [29]. In this data set, we have prices for 205 automobiles, along with other features such as fuel type, engine type, and engine size. We encoded all categorical and nominal features using ordinal encoding, which preserved the attribute size of 26. This data set also included the target variable price, in the range of \([5118, \ldots , 45400]\).

Fig. 1
figure 1

Correlation Plot for UCI Auto data, using Pearson Correlation Coefficient as a distance metric between features. Higher values of positive correlation are indicated in dark blue and negative correlation in red. In the feature subset selection problem, we wish to select a subset of features such that the relevancy is maximized with regards to a target variable, in this case, price, and the feature-to-feature correlation amongst independent variables is minimized

In performing an exploratory analysis of the Auto data, we examined the correlation between features and target variables. In reviewing Fig. 1 for the UCI Auto data, we notice that there are strongly correlated features with the target value of price. In terms of positive correlation, we see curb weight (PCC= 0.80) and engine size (PCC= 0.84) as having a strong linear relationship, and city and highway miles per gallon (PCC = \(-\)0.66, PCC= \(-\)0.69) as having a strong negative correlation.

2.4 Estimators

During the training/testing regime, we used a 70/30 percentage split between training and test sets. We measured the performance of two different types of predictive models for the underlying regression problem. These models included the following:

Multiple linear regression (LR): We used a multiple linear regression model to estimate predicted values of \(\mathbf {\hat{y}}\) [30]. Multiple Linear Regression is defined as:

$$\begin{aligned} \textbf{y} = \textbf{X} \textbf{w} + \mathbf {\epsilon } \end{aligned}$$
(12)

Where \(\textbf{w}\) are parameters which we wish to optimize \(\textbf{w}\) such that \(\epsilon = \textbf{y} - \textbf{X}\textbf{w}\) is minimized. We may then substitute new test data \(\textbf{X}\) in with our trained parameters to obtain \(\mathbf {\hat{y}}\). We use the term Multiple linear regression to clarify that we have two or more independent variables for which we would like to find a mapping to a target variable.

Gradient boosted regression trees (GBR): We looked at a separate regression model, gradient-boosted ensembles of regression trees, in order to benchmark vs. LR. The gradient-boosted trees regression trees algorithm works by sequentially creating trees from the features of the training data sets, and the structure of the combined trees forms an ensemble of trees that output a prediction given new data. For further reading, insightful discussion, and definitions for tree-based learning methods please see [31].

Performance criteria for our trained models included the mean absolute error (MAE). Model validation was performed using cross-validation on a randomized held-out set of 3 folds.

In examining the Friedman 1 data set, we designed a performance metric in order to test whether or not the FS algorithm selected the first 5 features in the set, which we knew to be the optimal subset. We call this performance metric \(\textit{Subset accuracy}\) which we define using a hit score where we take the cardinality of the intersection of the set of the index of selected features \(\mathbf {k_{selected}}\) and set of the index of optimal features \(\mathbf {k_{opt}}\) divided by the cardinality of the optimal feature set.

$$\begin{aligned} hit score= \frac{card(\mathbf {k_{selected}} \cap \mathbf {k_{opt}})}{card(\mathbf {k_{opt}})} \end{aligned}$$
(13)

We then calculate a length score by taking the absolute value of the cardinality of \(\mathbf {k_{selected}}\) minus the cardinality of \(\mathbf {k_{opt}}\) and subtracting this value from 1.

$$\begin{aligned} \text{length score} = 1 - \frac{{card}(\mathbf {k_{selected}}) - {card}(\mathbf {k_{opt}})}{card(\mathbf {k_{opt}})} \end{aligned}$$
(14)

We then simply sum the two and divide by two to obtain Subset accuracy:

$$\begin{aligned} \text{Subset accuracy} = \frac{(hitscore + length score)}{2} \end{aligned}$$
(15)

2.5 Baselines, filter and wrapper methods

For each quantum-assisted feature selection method, we bootstrapped each run with 10 result sets, and for each result, queried the quantum processing unit (QPU) for 10000 shots. We evaluated each result set and took the best overall result, which we averaged over each fold of cross-validation. For each data set, we tested the quantum-assisted feature selection methods and compared them versus the following methods, each of which we cross-validated using 3 folds with replacement:

  • All features (All): We initially fit the estimator over all features in the training set of \(\textbf{X}\) and \(\textbf{y}\) in order to understand the test error and establish a baseline for performance criteria.

  • Greedy Ranked Method (GR): We devised a simple ranking algorithm, where for each feature column vector \(\mathbf {x_k} \in \mathbf {X_k}\), we calculated the \(MIC(\mathbf {x_k, y})\), and then sorted the features based on this relevancy criterion and selected the top k, in this case, the top fifty percent of ranked features. We used this heuristic since we did not have an intuition as to what the best k would be a priori.

  • Recursive Feature Elimination(RFE): We also tested a wrapper style feature selection algorithm, in this case, Recursive Feature Elimination (RFE) as shown in [32]. We did so in order to benchmark our quantum-assisted method with the RFE wrapper method. For further implementation details please see [32].

2.6 Parameters for experimentation

For each data set and each estimator, we manually tuned the parameters \(\alpha\) and \(\lambda\) within the QUBO formulation to values of 1000 and 10. With this, we were able to obtain reasonable results over our baseline methods. This can be explained by the parameter \(\alpha =1000\) producing a scaling effect, creating a more rugged objective landscape to optimize over. The parameter \(\lambda =10\) had the effect of constraining the result sets to smaller or larger ranges for sizes of k. An interesting detail of experimentation resulted when testing various settings for \(\lambda\); scaling to different values sometimes led to finding values for k which were improvements in regards to model fit and predictive performance over restricting output to a pre-defined k. While we did not include optimizing these hyper-parameters in the scope of this work, we believe that future work could entail investigating this point in further detail.

2.7 Implementation details

Python Code for experiments was generated using the libraries scikit-learn [33], pandas [34], and scipy [35]. Access to the D-Wave quantum annealing machine was obtained using Leap software and API and the dimod library. One important point to note is that D-Wave provides a set of software tools to embed the binary quadratic model on the QPU. As the QPUs do not have full connectivity, a minor embedding must be created using chains of physical qubits. In this case, groups of physical qubits represent one logical qubit with full connectivity, which are coupled together using a chain strength. Chain strength is an additional parameter that can be optimized during the quantum subroutine. For our experiments, we used the automated minor embedding tools provided by D-Wave for the Advantage 1.1 sampler, which automatically creates minor embeddings on the QPU. The Advantage 1.1 sampler contains 5760 physical qubits which can be connected directly up to a degree of 15, and was the sampler used for our experiments. We believe that future work could also examine optimization over chain strength parameters and embeddings, which was out of scope for this work at this time, and we refer the interested reader to the D-Wave Ocean documentation for more details [36]. As we are currently in the NISQ era, we do have limits in regard to the number of features we can encode for one optimization run. We can also note that while D-wave provides hybrid tools for problems that are too large to fit on a single QPU, our problem sizes were small enough to fit on current generation QPUs. Some statistical measures were calculated using the minepy library [37].

3 Results and discussion

For our experiments, we looked at performance metrics in terms of prediction error (MAE) and feature selection subset size k over cross-validated held out test data for each of the two data sets. For the quantum-assisted routines, we took the average over the cross validation folds of the best sample from a bootstrapped sample set of 10 samples.

Results for the Friedman 1 data set (in Table 1) and UCI Auto data set (in Table 2) show optimal results obtained using PCC as the correlation statistic within the QUBO formulation for the quantum-assisted routine. This is reasonable, since both mappings of data sets to their target variables are continuous, and while there may be some nonlinear variables in the data, the results imply that the mapping of inputs to the target variable are mostly linear, and efficiently captured under the linear assumptions of PCC.

Table 1 Table of Results for Friedman 1 Data Set We include the Subset Accuracy metric (SA) here to understand the ratio of how many optimal features were selected out of the optimal subset \(\mathbf {k_{opt}}\). For this experiment, we see all of the quantum-assisted routines achieving approximately similar performance, with slight gains for QPPC-LR (Quantum-Assisted Pearson correlation coefficient- Multiple Linear Regression) with the lowest averaged \(\mathbf {MAE = 2.27}\) and optimal size for \(k= 5\), with a Subset Accuracy score of 0.9

For the Friedman 1 data set, the quantum-assisted routine using Pearson correlation coefficient within the QUBO formulation and multiple linear regression as the learning algorithm (QPCC-LR) achieved the best performance with the lowest averaged Mean Absolute Error, \(MAE\)=2.27, an optimal size for \(k=5\), with a Subset Accuracy score of 0.9 as shown in Table 1. While this predictive algorithm achieved the lowest error, note that the gradient-boosted decision trees using PCC (QPCC-GBR) performed comparatively well. The scores of Subset Accuracy (\(SA=0.9\)) for both indicate that the feature selection algorithm performed well in selecting the optimal features for prediction, although not achieving a perfect score of 1. This could be explained by the feature selection algorithm excluding features that may be members of the generating subset, but which also have low correlation to the target variable due to non-linearity or noise.

For the UCI Auto data set the best score was obtained by the quantum-assisted routine using Pearson correlation coefficient within the QUBO formulation and using the gradient boosted regression trees as a learning algorithm (QPCC-GBR) achieving the lowest \(MAE\)=1471 and size for k=5 as shown in Table 2. Although the other distance metrics did not outperform PCC, some were comparable, as in the case of MIC in application to LR for the Friedman Data set, or MI for GBR on the Auto Data set. This hints that these metrics may be as robust as PCC for certain types of applications and data sets, and it may be that there are other data sets with which these statistics outperform PCC and emphasize the minimum redundancy component of the MRMR heuristic and may capture more information from nonlinear relationships. Overall, the quantum-assisted feature selection methods achieved results that outperformed our baselines of all features, greedy selection, and recursive feature selection on both data sets.

Table 2 Table of Results for UCI Auto Data set For this experiment, in the table above, we see the best score for the quantum-assisted routine using the Pearson correlation coefficient, with a gradient-boosted decision tree learning algorithm obtaining the lowest \(MAE\) =1471 and size for k=5

4 Conclusion

In conclusion, we show that by leveraging quantum-assisted routines within machine learning training and testing regimes, we achieve solutions that beat our defined benchmarks with regard to model performance on cross-validated data. We also show that the quantum-assisted feature selection routines are able to filter subsets of data with high subset accuracy. We also uncovered that quantum-assisted routines may show the additional benefit of discovering sizes of k, given some slight tuning of \(\alpha\) and \(\lambda\), which show performance improvements. Finally, we show that we are able to increase model performance while reducing the input dimensionality of the data to the predictive models.

Future work could investigate optimizing these hyper-parameters using Bayesian optimization or other methodology in order to discover optimal subset sizes of k automatically. This investigation could also include optimizing the chain strengths of the minor embeddings from the binary quadratic model to the QPU on the quantum annealer.

While we limit the scope of this work to focus on using the quantum annealer as the device for our quantum-assisted routine, this problem could be formulated to run as an input problem Hamiltonian for a variational quantum algorithm on a gate model chip. We also note that in this work we compute distance metrics classically, and there could be some value in further investigation of encoding data onto a gate model QPU and computing distance on the QPU as a step before running a quantum approximate optimization algorithm, to get solution samples of subsets of features. Future work could also explore this point in more detail, as well as investigate the limits of feature selection problem sizes that can currently fit on NISQ-era chips.