Predicting vehicle prices via quantum-assisted feature selection

Dollen, David Von; Neukart, Florian; Weimer, Daniel; Bäck, Thomas

doi:10.1007/s41870-023-01370-z

Predicting vehicle prices via quantum-assisted feature selection

Original Research
Open access
Published: 21 July 2023

Volume 15, pages 2897–2905, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Information Technology Aims and scope Submit manuscript

Predicting vehicle prices via quantum-assisted feature selection

Download PDF

David Von Dollen ORCID: orcid.org/0000-0002-9831-0685¹,
Florian Neukart^1,3^na1,
Daniel Weimer²^na1 &
…
Thomas Bäck¹^na1

1150 Accesses
5 Citations
Explore all metrics

Abstract

Feature selection is a technique used to reduce complexity and improve performance in regards to generalization, fit, and accuracy of prediction for machine learning models. A central challenge in this selection process is that the search over the space of features to find the subset of k optimal features is a known NP-Hard problem. In this work, we study metrics for encoding the combinatorial search as a binary quadratic model, such as Generalized Mean Information Coefficient and Pearson Correlation Coefficient in application to the underlying regression problem of price prediction. We compare predictive model performance in regards to leveraging quantum-assisted vs. classical subroutines for the search, using minimum redundancy maximal relevancy as the heuristic for our approach. We cross validate predictive models on a real world problem of price prediction, and show a performance improvement of mean absolute error scores for our quantum-assisted method $(1471.02 \pm {135.6})$ vs. similar methodologies such as greedy selection $(1707.2 \pm {168})$, recursive feature elimination $(1678.3 \pm {143.7})$, and all features $(1546.2 \pm {154}$). Our findings show that by leveraging quantum assisted routines we find solutions which increase the quality of predictive model output while reducing the input dimensionality to the learning algorithm on synthetic and real-world data.

An Approach of Feature Subset Selection Using Simulated Quantum Annealing

Forecasting House Price with an Optimum Set of Features

Metaheuristics for Feature Selection: A Comprehensive Comparison Using Opytimizer

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data pre-processing techniques, along with exploratory analysis and feature engineering, are standard steps within pipelines for predictive model training and deployment [1]. In practice, before feeding data to a learning algorithm, it is common to cleanse data sets in order to improve predictive performance and generalization capability of models.

Feature selection (FS) is a pre-processing technique, in which a subset of features is selected from a data set that may contain more relevant information than the total set of features, due to co-linearity, redundant, or constant features within the total data set. Benefits to utilizing FS within a pre-processing step include reduced data dimensionality, better scaling for learning algorithms in practice, and reduced costs in regard to model training and fitting. Feature selection differs from other techniques such as Feature Engineering in that it does not create a projection of linear combinations of features into a new subspace, or a change of basis using an eigendecomposition of a covariance data matrix, as in the case of Principal Components Analysis [2], but rather filters down and eliminates features with low information in regards to a target variable. This is an important property since the sparse representation and preservation of the original feature space allows for higher visibility and explainability when accounting for predictive model output [3]. Feature Selection techniques may leverage an output from an estimator or model in the selection process, also known as wrapper methods [4] or may use heuristics and select an optimal subset, also known as filter methods. Features selection methods have been used in a variety of applications, including areas such as character recognition and healthcare[5,6,7,8]. In this work, we examine using quantum-assisted methods to find combinations of features of a subset of size k, which may show improved performance characteristics and a decrease in the input feature space for our learning algorithms, which is a known NP-Hard combinatorial optimization problem [9].

Our contribution is as follows: framing the combinatorial search as a binary quadratic model, we examine leveraging a quantum device to search the combinatorial space of finding optimal feature subsets of size k. We investigate various distance and correlation metrics, which we compute classically for the formulation of the binary quadratic optimization problem. We apply the heuristic of Maximal Relevancy Minimal Redundancy (MRMR) [10] as an approach to the feature selection problem, with the idea that we want to find a subset of features that may have a strong predictive signal in regards to a target variable (Maximal Relevancy), but low pairwise feature correlation between independent variables (Minimal Redundancy). We train and compare two types of regression models using quantum-assisted feature selection along with other benchmark selection methods including all features, greedy selection, and a wrapper method, over two data sets with continuous target values. We examine model predictive performance using this methodology and apply it to a real-world problem of data pre-processing for predictive models for vehicle prices. We show that by using quantum-assisted routines, we find combinations of features that increase the predictive quality of models on validation sets of data, and improve upon our benchmarks of all features, greedy feature selection, and recursive feature elimination.

1.1 Related work

The problem of Feature selection is well-studied in the literature [11, 12]. Recent approaches apply mutual-information-based metrics for feature selection for supervised learning in regards to classification applications [13]. In recent years, new correlation metrics have been proposed which may have more expressive power in measuring relationships between variables. Examples of this are Maximal Information Coefficient (MIC) [14] and Generalized Mean Information Coefficient (GMIC) [15], which introduce the concept of equability in variable relationships and may be more robust when dealing with non-linear relationships than correlation statistics such as Pearson correlation coefficient, which assume linear relationships.

In the current era of quantum computing, the availability of devices leveraging quantum processing units (QPUs) has come online, and applications have been developed leveraging these devices for solving real-world problems within various industries. For example, in [16], the authors leveraged a quantum annealing system to optimize traffic flows around the city of Beijing, and the authors in [17] showed how to price options using quantum algorithms run on a gate-model quantum chip.

The feature subset selection problem was formulated by the authors in [18] as a quadratic program, where mutual information and Pearson correlation coefficient were used to calculate the matrix $\textbf{Q}$, or a symmetric positive semi-definite matrix representing quadratic terms for minimizing the multi-variate objective function. There have been research efforts to apply quantum annealing to searching feature space for optimal subsets using mutual information-based formulations of interactions and linear terms for Ising spin-glass models and quadratic unconstrained binary optimization [19, 20]. Some of these efforts have included complexity considerations in regards to scaling for the quantum-assisted feature subset selection problem, claiming performance gains of $O(1/m^2)$ versus $O(mn^2)$ for classical computation [20]. This work in particular, claims a bound for the quantum-assisted routine given by the size of the minimum gap $g(t) = E0-E1$ in the energy eigenvalues during in the annealing regime, with a resulting time complexity of $T = O(\frac{1}{g^2min})$ where T is the upper time limit.

In regards to the price prediction problem, this has been well studied in the literature for machine learning research [21,22,23]. In [21], the authors built predictive models for price prediction for used cars in Mauritius. The models included Naive Bayes and Decision tree algorithms, which contained a classification step with reported accuracy rates in a range of 60–70%, and achieved a mean error of 51000 and 27000 for the regression component using linear regression. Monburinon et. al tested various regression models for price prediction for German used cars in [22], with gradient-boosted decision trees outperforming random forest and multiple regression (mean squared error = 0.28). In [23] the authors applied feature selection techniques for multiple regression models for price prediction and reported score in regards to model fitness ($R^2$ = 0.9861).

2 Methods

Consider a data set:

$$\begin{aligned} \mathrm{I\!D} = \{\textbf{X}\in {\mathrm{I\!R}}^{N\times M}, \textbf{y}\in {\mathrm{I\!R}} ^N\} \end{aligned}$$

where $\textbf{X}$ is a data matrix of size $N \times M$ of N rows, and M columns, and $\textbf{y}$ is a column vector of size $N \times 1$ of N rows, and 1 column. We wish to find an approximate functional mapping or hypothesis of $h(\textbf{X}) \simeq \textbf{y}$ using various learning algorithms. This is also known as supervised learning, and our goal is to minimize a generalization error for a given loss function between predictions and a target variable in order to make predictions given new data or find the best hypothesis from the hypothesis space which the given learning algorithm encompasses.

In the feature subset selection problem we wish to find a subset of features $\mathbf {X_k}$ where each row vector $\mathbf {x_i}$ in $\textbf{X}$ (where $i = \{1, 2,... N\}$) is of reduced column size k, i.e each row vector is filtered down as in {$x_{i1}, x_{i2},..x_{ik}\}$ where $k < M$. We may investigate various values of k such that the loss for our function $h(\mathbf {X_k}) \simeq \textbf{y}$ is less than or equal to $h(\textbf{X}) \simeq \textbf{y}$ cross-validated on a holdout test set of data. Essentially the feature subset selection problem is one of choosing the optimal subset of features or columns from $\textbf{X}$ of size k. In this problem, we assume that there exists an optimal subset of size k, which may not be the case for all data sets, where the optimal set $h(\mathbf {X_k}) = h(\textbf{X})$ or where $k = M$.

2.1 Relevancy, correlation and distance metrics

In order to compare features with a target variable, and investigate pair-wise relationships amongst the feature set $\textbf{X}$, we must first define our distance functions with which we compute classically to use as input to formulate the binary optimization problem for the feature subset selection which we use as input for our predictive models. In our experiments, we will switch out and compare each distance function when used to model linear and quadratic terms for a binary quadratic model. Note that we compute these distances classically before using them as input in the form of quadratic terms of a binary quadratic model, which we sample using a quantum annealer to get subsets of features.

We examined Maximal Information Coefficient (MIC), Generalized Mean Information Coefficient (GMIC), Mutual Information (MI), and Pearson Correlation Coefficient (PCC) for this study. Let us define each in the following:

Mutual Information (MI) is defined as:

$$\begin{aligned} MI(\textbf{x, y}) = \sum _{\textbf{x, y}} P(\textbf{x, y}) \ln {{P(\textbf{x, y})}\over {P(\textbf{x}) P(\textbf{y})}} \end{aligned}$$

(1)

In our case, we are looking at mixes of purely continuous as well as continuous and discrete variables, with which there are various strategies for binning continuous values to estimate probabilities for the MI calculation. These include kernel density estimation, binning continuous to discrete variables, and clustering algorithms. For our method we use the k-nn binning strategy from [24] to estimate the MI:

$$\begin{aligned} I(\textbf{x, y}) = \psi {(N)}- \langle \psi {(\textbf{x})}\rangle +\psi {(\kappa )} - \langle \psi {(\textbf{m})}\rangle \end{aligned}$$

(2)

Where $\psi$ is the digamma function, $\kappa$ is the distance for the designated nearest neighbor and m is the count of the number of neighbors. For more details please also see [25].

Various algorithms have been proposed to handle a mix of discrete and continuous and purely continuous algorithms based on mapping mutual information to a grid. MIC and GMIC belong in this category.

MIC is defined as [15]:

First we define a Characteristic Matrix C:

$$\begin{aligned} C(\textbf{x, y})_{i,j} = \frac{I^*(\textbf{x,y})_{i,j}}{\log _2 \min \{i, j\}} \end{aligned}$$

(3)

Where $I^*(\textbf{x,y}$ represents the binned values of $\textbf{x,y}$ in a grid $G_{ij}$. Then we obtain the MIC using this characteristic matrix:

$$\begin{aligned} MIC(\textbf{x,y}) = \max _{ij<B(n)}\{C(\textbf{x, y})_{i,j}\} \end{aligned}$$

(4)

where $B(n) = n^{0.6}$ is a maximal grid size as recommended in the original text [14]. Note that we consider ij in this equation as a product notating grid size.

We can define GMIC by taking the characteristic matrix and extending it to find the maximal characteristic matrix:

$$\begin{aligned} C^*(\textbf{x, y})_{i,j} = \max _{kl<ij)}\{C(\textbf{x, y})_{kl}\} \end{aligned}$$

(5)

Where we again use the terms kl and ij to denote grid sizes. We may then use this to define the GMIC measure as defined in [15].

$$\begin{aligned} GMIC(\textbf{x, y}) = \left( \frac{1}{Z} \sum _{ij<B(n)}(C^*(\textbf{x, y})_{i,j}^p \right) ^{1/p} \end{aligned}$$

(6)

Where $p \in [-\infty , \infty ]$. For our work, we set $p = -1$ and Z is the primorial of (i, j) where $ij \le B(n)$ as outlined in the original work [15].

There exists a lively debate in the literature as to the significance of the statistical power of these methodologies [26]. We note that we incur additional overhead in regards to computational cost in allocating grids, as well as tuning parameters such as B(n) in the case of MIC and p in GMIC using these methods.

In our study, we reviewed the performance of the Pearson Correlation Coefficient (PCC). This statistical measure is ubiquitous in science and engineering for measuring linear relationships between variables.

Pearson correlation coefficient is defined as [27]:

$$\begin{aligned} PCC(\textbf{x, y}) = \frac{ \sum _{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum _{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum _{i=1}^{n}(y_i-\bar{y})^2}} \end{aligned}$$

(7)

Where $\bar{x}$ and $\bar{y}$ are the mean of $\textbf{x}$ and $\textbf{y}$.

2.2 QUBO formulation

In order to formulate our problem as a quadratic binary optimization model (QUBO) we first need to calculate a distance measure for each column vector i in $\textbf{X}$ vs. $\textbf{y}$. In this case, we use the absolute value of the distance measure and negate it as we wish to find a minimum for our optimization problem. These values then become the linear terms along the diagonal of the $\textbf{Q}$ matrix.

With these linear terms, we encode the $maximal \ relevancy$ portion of our heuristic, although we negate the values as the optimization takes the form of finding the minimum of the domain. By taking the absolute value of the distance function, we give greater weight to features that have higher relevancy or correlation with the target variable by treating positive and negative correlation equally.

Then, we calculate the distance between each pairwise column vector indexed at i and j in $\textbf{X}$. This allows us to formulate the interactions for the quadratic terms for the binary quadratic model, which become values along the upper diagonal of the $\textbf{Q}$ matrix. This encodes the $minimum \ redundancy$ characteristic of our heuristic. We want to find combinations of features that are not correlated, or are more distant from each other while retaining relevancy to the target variable.

$$\begin{aligned} \textbf{Q}_{ij} = {\left\{ \begin{array}{ll} - \mid {distance(X_i, y)} \mid , &{} \text {if}\ i=j \\ \mid {distance(X_i, X_j)} \mid , &{} \text {if}\ i < j \\ 0, \text {otherwise} \end{array}\right. } \end{aligned}$$

(8)

We combine these to obtain our QUBO formulation for our optimization problem. For the sake of clarification, we use the $\mathbf {\omega }$ term here to represent a vector of qubit values. We include a scaling parameter $\alpha$ that we use to scale the domain of the optimization landscape.

$$\begin{aligned} E({\omega }) = \alpha \sum _{i\le j}{{\omega }}_i\textbf{Q}_{ij}{\omega }_j \qquad {\omega } \in \{0, 1\} \end{aligned}$$

(9)

Finally, we impose a penalty term, such that the resulting sample from our objective function only has k qubits turned "on", or have a value of 1. We introduce a scaling parameter for the penalty term, $\lambda$, to enforce this constraint.

$$\begin{aligned} E({\omega }) = \alpha \sum _{i\le j}{\omega }_i\textbf{Q}_{ij}\mathbf {\omega }_j + \lambda (\sum _i{\omega }_i - k)^2 \end{aligned}$$

(10)

We then use this formulation to follow an annealing schedule and sample solutions to find a minimum energy E. In our resulting vector $\mathbf {\mathbf {\omega }}$, which is of M length, our constraint ensures that k qubits have a value of 1, and the rest 0. We then use this vector to filter the data set $\textbf{X}$ to obtain $\mathbf {X_k}$, where the k columns are filtered from the index location where $\mathbf {\omega }_m$ = 1 for { $\omega _1$, $\omega _2$, ...,$\omega _m$}.

2.3 Data sets

For our experiments, we tested our feature selection algorithms on two data sets, one synthetic, and one drawn from real-world samples.

The first dataset that we used was the Friedman 1 dataset [28]. Here, the data is generated by the function:

$$\begin{aligned} \textbf{y} = 10 \sin (\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \epsilon \end{aligned}$$

(11)

for each row, $x_i \in \textbf{X}$ where $\epsilon$ is some random, normally distributed noise, and the rest of the features are independent and drawn from a uniform distribution on the interval of [0,1]. We generated 100 instances of this data for our train/test 70/30 percentage split, with a feature size of 50. We specifically tested this data set in order to study the difference in performance using various mutual information-based metrics within the QUBO formulation of the optimization problem, since the generating function contains a non-linear term, and one of the reported gains in using these distance metrics is in measuring non-linear relationships. Another advantage is experimenting with this data set is that the optimal subset of features is known in advance, and therefore we may determine how accurate the feature selection algorithms are in response to the optimum.

The next data set that we tested was for vehicle price prediction using the open source Auto data set from the UCI machine learning repository [29]. In this data set, we have prices for 205 automobiles, along with other features such as fuel type, engine type, and engine size. We encoded all categorical and nominal features using ordinal encoding, which preserved the attribute size of 26. This data set also included the target variable price, in the range of $[5118, \ldots , 45400]$.

In performing an exploratory analysis of the Auto data, we examined the correlation between features and target variables. In reviewing Fig. 1 for the UCI Auto data, we notice that there are strongly correlated features with the target value of price. In terms of positive correlation, we see curb weight (PCC= 0.80) and engine size (PCC= 0.84) as having a strong linear relationship, and city and highway miles per gallon (PCC = $-$0.66, PCC= $-$0.69) as having a strong negative correlation.

2.4 Estimators

During the training/testing regime, we used a 70/30 percentage split between training and test sets. We measured the performance of two different types of predictive models for the underlying regression problem. These models included the following:

Multiple linear regression (LR): We used a multiple linear regression model to estimate predicted values of $\mathbf {\hat{y}}$ [30]. Multiple Linear Regression is defined as:

$$\begin{aligned} \textbf{y} = \textbf{X} \textbf{w} + \mathbf {\epsilon } \end{aligned}$$

(12)

Where $\textbf{w}$ are parameters which we wish to optimize $\textbf{w}$ such that $\epsilon = \textbf{y} - \textbf{X}\textbf{w}$ is minimized. We may then substitute new test data $\textbf{X}$ in with our trained parameters to obtain $\mathbf {\hat{y}}$. We use the term Multiple linear regression to clarify that we have two or more independent variables for which we would like to find a mapping to a target variable.

Gradient boosted regression trees (GBR): We looked at a separate regression model, gradient-boosted ensembles of regression trees, in order to benchmark vs. LR. The gradient-boosted trees regression trees algorithm works by sequentially creating trees from the features of the training data sets, and the structure of the combined trees forms an ensemble of trees that output a prediction given new data. For further reading, insightful discussion, and definitions for tree-based learning methods please see [31].

Performance criteria for our trained models included the mean absolute error (MAE). Model validation was performed using cross-validation on a randomized held-out set of 3 folds.

In examining the Friedman 1 data set, we designed a performance metric in order to test whether or not the FS algorithm selected the first 5 features in the set, which we knew to be the optimal subset. We call this performance metric $\textit{Subset accuracy}$ which we define using a hit score where we take the cardinality of the intersection of the set of the index of selected features $\mathbf {k_{selected}}$ and set of the index of optimal features $\mathbf {k_{opt}}$ divided by the cardinality of the optimal feature set.

$$\begin{aligned} hit score= \frac{card(\mathbf {k_{selected}} \cap \mathbf {k_{opt}})}{card(\mathbf {k_{opt}})} \end{aligned}$$

(13)

We then calculate a length score by taking the absolute value of the cardinality of $\mathbf {k_{selected}}$ minus the cardinality of $\mathbf {k_{opt}}$ and subtracting this value from 1.

$$\begin{aligned} \text{length score} = 1 - \frac{{card}(\mathbf {k_{selected}}) - {card}(\mathbf {k_{opt}})}{card(\mathbf {k_{opt}})} \end{aligned}$$

(14)

We then simply sum the two and divide by two to obtain Subset accuracy:

$$\begin{aligned} \text{Subset accuracy} = \frac{(hitscore + length score)}{2} \end{aligned}$$

(15)

2.5 Baselines, filter and wrapper methods

For each quantum-assisted feature selection method, we bootstrapped each run with 10 result sets, and for each result, queried the quantum processing unit (QPU) for 10000 shots. We evaluated each result set and took the best overall result, which we averaged over each fold of cross-validation. For each data set, we tested the quantum-assisted feature selection methods and compared them versus the following methods, each of which we cross-validated using 3 folds with replacement:

All features (All): We initially fit the estimator over all features in the training set of $\textbf{X}$ and $\textbf{y}$ in order to understand the test error and establish a baseline for performance criteria.
Greedy Ranked Method (GR): We devised a simple ranking algorithm, where for each feature column vector $\mathbf {x_k} \in \mathbf {X_k}$, we calculated the $MIC(\mathbf {x_k, y})$, and then sorted the features based on this relevancy criterion and selected the top k, in this case, the top fifty percent of ranked features. We used this heuristic since we did not have an intuition as to what the best k would be a priori.
Recursive Feature Elimination(RFE): We also tested a wrapper style feature selection algorithm, in this case, Recursive Feature Elimination (RFE) as shown in [32]. We did so in order to benchmark our quantum-assisted method with the RFE wrapper method. For further implementation details please see [32].

2.6 Parameters for experimentation

For each data set and each estimator, we manually tuned the parameters $\alpha$ and $\lambda$ within the QUBO formulation to values of 1000 and 10. With this, we were able to obtain reasonable results over our baseline methods. This can be explained by the parameter $\alpha =1000$ producing a scaling effect, creating a more rugged objective landscape to optimize over. The parameter $\lambda =10$ had the effect of constraining the result sets to smaller or larger ranges for sizes of k. An interesting detail of experimentation resulted when testing various settings for $\lambda$; scaling to different values sometimes led to finding values for k which were improvements in regards to model fit and predictive performance over restricting output to a pre-defined k. While we did not include optimizing these hyper-parameters in the scope of this work, we believe that future work could entail investigating this point in further detail.

2.7 Implementation details

Python Code for experiments was generated using the libraries scikit-learn [33], pandas [34], and scipy [35]. Access to the D-Wave quantum annealing machine was obtained using Leap software and API and the dimod library. One important point to note is that D-Wave provides a set of software tools to embed the binary quadratic model on the QPU. As the QPUs do not have full connectivity, a minor embedding must be created using chains of physical qubits. In this case, groups of physical qubits represent one logical qubit with full connectivity, which are coupled together using a chain strength. Chain strength is an additional parameter that can be optimized during the quantum subroutine. For our experiments, we used the automated minor embedding tools provided by D-Wave for the Advantage 1.1 sampler, which automatically creates minor embeddings on the QPU. The Advantage 1.1 sampler contains 5760 physical qubits which can be connected directly up to a degree of 15, and was the sampler used for our experiments. We believe that future work could also examine optimization over chain strength parameters and embeddings, which was out of scope for this work at this time, and we refer the interested reader to the D-Wave Ocean documentation for more details [36]. As we are currently in the NISQ era, we do have limits in regard to the number of features we can encode for one optimization run. We can also note that while D-wave provides hybrid tools for problems that are too large to fit on a single QPU, our problem sizes were small enough to fit on current generation QPUs. Some statistical measures were calculated using the minepy library [37].

3 Results and discussion

For our experiments, we looked at performance metrics in terms of prediction error (MAE) and feature selection subset size k over cross-validated held out test data for each of the two data sets. For the quantum-assisted routines, we took the average over the cross validation folds of the best sample from a bootstrapped sample set of 10 samples.

Results for the Friedman 1 data set (in Table 1) and UCI Auto data set (in Table 2) show optimal results obtained using PCC as the correlation statistic within the QUBO formulation for the quantum-assisted routine. This is reasonable, since both mappings of data sets to their target variables are continuous, and while there may be some nonlinear variables in the data, the results imply that the mapping of inputs to the target variable are mostly linear, and efficiently captured under the linear assumptions of PCC.

Table 1 Table of Results for Friedman 1 Data Set We include the Subset Accuracy metric (SA) here to understand the ratio of how many optimal features were selected out of the optimal subset $\mathbf {k_{opt}}$. For this experiment, we see all of the quantum-assisted routines achieving approximately similar performance, with slight gains for QPPC-LR (Quantum-Assisted Pearson correlation coefficient- Multiple Linear Regression) with the lowest averaged $\mathbf {MAE = 2.27}$ and optimal size for $k= 5$, with a Subset Accuracy score of 0.9

Full size table

For the Friedman 1 data set, the quantum-assisted routine using Pearson correlation coefficient within the QUBO formulation and multiple linear regression as the learning algorithm (QPCC-LR) achieved the best performance with the lowest averaged Mean Absolute Error, $MAE$=2.27, an optimal size for $k=5$, with a Subset Accuracy score of 0.9 as shown in Table 1. While this predictive algorithm achieved the lowest error, note that the gradient-boosted decision trees using PCC (QPCC-GBR) performed comparatively well. The scores of Subset Accuracy ($SA=0.9$) for both indicate that the feature selection algorithm performed well in selecting the optimal features for prediction, although not achieving a perfect score of 1. This could be explained by the feature selection algorithm excluding features that may be members of the generating subset, but which also have low correlation to the target variable due to non-linearity or noise.

For the UCI Auto data set the best score was obtained by the quantum-assisted routine using Pearson correlation coefficient within the QUBO formulation and using the gradient boosted regression trees as a learning algorithm (QPCC-GBR) achieving the lowest $MAE$=1471 and size for k=5 as shown in Table 2. Although the other distance metrics did not outperform PCC, some were comparable, as in the case of MIC in application to LR for the Friedman Data set, or MI for GBR on the Auto Data set. This hints that these metrics may be as robust as PCC for certain types of applications and data sets, and it may be that there are other data sets with which these statistics outperform PCC and emphasize the minimum redundancy component of the MRMR heuristic and may capture more information from nonlinear relationships. Overall, the quantum-assisted feature selection methods achieved results that outperformed our baselines of all features, greedy selection, and recursive feature selection on both data sets.

Table 2 Table of Results for UCI Auto Data set For this experiment, in the table above, we see the best score for the quantum-assisted routine using the Pearson correlation coefficient, with a gradient-boosted decision tree learning algorithm obtaining the lowest $MAE$ =1471 and size for k=5

Full size table

4 Conclusion

In conclusion, we show that by leveraging quantum-assisted routines within machine learning training and testing regimes, we achieve solutions that beat our defined benchmarks with regard to model performance on cross-validated data. We also show that the quantum-assisted feature selection routines are able to filter subsets of data with high subset accuracy. We also uncovered that quantum-assisted routines may show the additional benefit of discovering sizes of k, given some slight tuning of $\alpha$ and $\lambda$, which show performance improvements. Finally, we show that we are able to increase model performance while reducing the input dimensionality of the data to the predictive models.

Future work could investigate optimizing these hyper-parameters using Bayesian optimization or other methodology in order to discover optimal subset sizes of k automatically. This investigation could also include optimizing the chain strengths of the minor embeddings from the binary quadratic model to the QPU on the quantum annealer.

While we limit the scope of this work to focus on using the quantum annealer as the device for our quantum-assisted routine, this problem could be formulated to run as an input problem Hamiltonian for a variational quantum algorithm on a gate model chip. We also note that in this work we compute distance metrics classically, and there could be some value in further investigation of encoding data onto a gate model QPU and computing distance on the QPU as a step before running a quantum approximate optimization algorithm, to get solution samples of subsets of features. Future work could also explore this point in more detail, as well as investigate the limits of feature selection problem sizes that can currently fit on NISQ-era chips.

Data availability

Data used in this work available from corresponding author upon reasonable request.

References

Kotsiantis S, Kanellopoulos D, Pintelas P (2007) Data preprocessing for supervised leaning. World Acad Sci Eng Technol Int J Comput Electr Autom Control Inf Eng 1:4104–4109
Google Scholar
Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. Zenodo. https://doi.org/10.1080/14786440109462720
Article MATH Google Scholar
Jiang P, Suzuki H, Obi T (2023) Xai-based cross-ensemble feature ranking methodology for machine learning models. Int J Inf Technol. https://doi.org/10.1007/s41870-023-01270-2
Article Google Scholar
Ng A.Y (1998) On feature selection: learning with exponentially many irrelevant features as training examples. In: Proceedings of the fifteenth international conference on machine learning. ICML ’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, p 404–412
Sharma M (2019) Improved autistic spectrum disorder estimation using Cfs subset with greedy stepwise feature selection technique. Int J Inf Technol. https://doi.org/10.1007/s41870-019-00335-5
Article Google Scholar
Dey R, Rakesh C (2021) An efficient feature representation strategy for offline computer synthesized font characters using multivariate filter based feature selection technique. Int J Inform Technol. https://doi.org/10.1007/s41870-021-00780-1
Article Google Scholar
Sharma A, Mishra PK (2021) Performance analysis of machine learning based optimized feature selection approaches for breast cancer diagnosis. Int J Inf Technol. https://doi.org/10.1007/s41870-021-00671-5
Article Google Scholar
Nayak S, Biswal P, Pradhan S, Jena O (2022) Implementation of an integrated classification approach of adaptive extreme learning machine and correlation based feature selection for odia complex characters. Int J Inf Technol. https://doi.org/10.1007/s41870-022-01008-6
Article Google Scholar
Chen B, Hong J, Wang Y (1997) The minimum feature subset selection problem. J Comput Sci Technol 12(2):145–153. https://doi.org/10.1007/BF02951333
Article MathSciNet Google Scholar
Peng Hanchuan, Long Fuhui, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238. https://doi.org/10.1109/TPAMI.2005.159
Article Google Scholar
Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97(1):245–271. https://doi.org/10.1016/S0004-3702(97)00063-5
Article MathSciNet MATH Google Scholar
Motoda H, Liu H (2002) Feature selection, extraction and construction. Commun IICM (Inst Inf Comput Mach Taiwan) 5:67–72
Google Scholar
Nguyen X.V, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’14. Association for Computing Machinery, New York, NY, USA, p 512–521. https://doi.org/10.1145/2623330.2623611
Reshef D.N, Reshef Y.A, Mitzenmacher M, Sabeti P.C (2013) Equitability analysis of the maximal information coefficient, with comparisons. http://arxiv.org/abs/1301.6314. Accessed 2021
Luedtke A, Tran L (2013) The generalized mean information coefficient. https://arxiv.org/abs/1308.5712. Accessed 2021
Neukart F, Compostella G, Seidel C, von Dollen D, Yarkoni S, Parney B (2017) Traffic flow optimization using a quantum annealer. Front ICT 4:29. https://doi.org/10.3389/fict.2017.00029
Article Google Scholar
Stamatopoulos N, Egger DJ, Sun Y, Zoufal C, Iten R, Shen N, Woerner S (2020) Option pricing using quantum computers. Quantum 4:291. https://doi.org/10.22331/q-2020-07-06-291
Article Google Scholar
Rodriguez-Lujan I, Huerta R, Elkan C, Cruz CS (2010) Quadratic programming feature selection. J Mach Learn Res 11(49):1491–1516
MathSciNet MATH Google Scholar
Milne A, Goddard P, Rounds M (2018) Credit scoring and classification using a quantum annealer
Sharma K (2019) Quantum adiabatic feature selection. arXiv:1909.08732v1. Accessed 2021
Pudaruth S (2014) Predicting the price of used cars using machine learning techniques. Int J Inf Comput Technol 4:753–764
Google Scholar
Monburinon N, Chertchom P, Kaewkiriya T, Rungpheung S, Buya S, Boonpou P (2018) Prediction of prices for used car by using regression models. 2018 5th International conference on business and industrial research (ICBIR). p 115–119
Noor K, Jan S (2017) Vehicle price prediction system using machine learning techniques. Int J Comput Appl 167(9):27–31. https://doi.org/10.5120/ijca2017914373
Article Google Scholar
Ross BC (2014) Mutual information between discrete and continuous data sets. PLoS One 9(2):1–5. https://doi.org/10.1371/journal.pone.0087357
Article Google Scholar
Singh H, Misra N, Hnizdo V, Fedorowicz A, Demchuk E (2003) Nearest neighbor estimates of entropy. Am J Math Manag Sci 23(3–4):301–321. https://doi.org/10.1080/01966324.2003.10737616
Article MathSciNet Google Scholar
Kinney JB, Atwal GS (2014) Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci. https://doi.org/10.1073/pnas.1309933111
Article MATH Google Scholar
List of probability and statistics symbols (2020). https://mathvault.ca/hub/higher-math/math-symbols/probability-statistics-symbols/
Friedman JH (1991) Multivariate adaptive regression splines. Ann Statist 19(1):1–67. https://doi.org/10.1214/aos/1176347963. Accessed 2021
Article MathSciNet MATH Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Freedman D (2005) Statistical models : theory and practice. Cambridge University Press
Breiman L, Friedman J.H, Olshen R.A, Stone C.J (1984) Classification and regression trees. Wadsworth. urlhttps://s2.smu.edu/ mhd/8331f06/cart.pdf
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. https://doi.org/10.1023/A:1012487302797
Article MATH Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Wes McKinney (2010) Data structures for statistical computing in python. In: Stéfan van der Walt, Jarrod Millman (eds) Proceedings of the 9th python in science conference. p 56–61. https://doi.org/10.25080/Majora-92bf1922-00a
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P (2020) SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar
D-wave Systems. https://docs.ocean.dwavesys.com/en/stable/concepts/embedding.html. Accessed 30 Mar 2021
Albanese D, Riccadonna S, Donati C, Franceschi P (2018) A practical tool for maximal information coefficient analysis. GigaScience. https://doi.org/10.1093/gigascience/giy032. Accessed 2021
Article Google Scholar
Von Dollen D, Neukart F, Weimer D, Bäck T (2021) Quantum-assisted feature selection for vehicle price prediction modeling. https://doi.org/10.48550/arXiv.2104.04049

Download references

Funding

The authors received no external funding while conducting this research work.

Author information

Florian Neukart, Daniel Weimer, and Thomas Bäck contributed equally to this work.

Authors and Affiliations

LIACS, Leiden University, Leiden, Netherlands
David Von Dollen, Florian Neukart & Thomas Bäck
Volkswagen Group Data:Lab, Munich, Germany
Daniel Weimer
Terra Quantum, San Francisco, CA, USA
Florian Neukart

Authors

David Von Dollen
View author publications
You can also search for this author in PubMed Google Scholar
Florian Neukart
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Weimer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Bäck
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DVD wrote the main manuscript text and conducted the main body of research. FN, DW, and THB contributed research direction and review. All authors reviewed the manuscript [38].

Corresponding author

Correspondence to David Von Dollen.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dollen, D.V., Neukart, F., Weimer, D. et al. Predicting vehicle prices via quantum-assisted feature selection. Int. j. inf. tecnol. 15, 2897–2905 (2023). https://doi.org/10.1007/s41870-023-01370-z

Download citation

Received: 26 December 2022
Accepted: 22 June 2023
Published: 21 July 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s41870-023-01370-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Predicting vehicle prices via quantum-assisted feature selection

Abstract

Similar content being viewed by others

An Approach of Feature Subset Selection Using Simulated Quantum Annealing

Forecasting House Price with an Optimum Set of Features

Metaheuristics for Feature Selection: A Comprehensive Comparison Using Opytimizer

1 Introduction

1.1 Related work

2 Methods

2.1 Relevancy, correlation and distance metrics

2.2 QUBO formulation

2.3 Data sets

2.4 Estimators

2.5 Baselines, filter and wrapper methods

2.6 Parameters for experimentation

2.7 Implementation details

3 Results and discussion

4 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting vehicle prices via quantum-assisted feature selection

Abstract

Similar content being viewed by others

An Approach of Feature Subset Selection Using Simulated Quantum Annealing

Forecasting House Price with an Optimum Set of Features

Metaheuristics for Feature Selection: A Comprehensive Comparison Using Opytimizer

1 Introduction

1.1 Related work

2 Methods

2.1 Relevancy, correlation and distance metrics

2.2 QUBO formulation

2.3 Data sets

2.4 Estimators

2.5 Baselines, filter and wrapper methods

2.6 Parameters for experimentation

2.7 Implementation details

3 Results and discussion

4 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation