1 Introduction

Machine learning (ML) models excel in a wide range of data analytics tasks, such as classification, regression, and data generation. However, most model families scale with the number of input data dimensions, i.e., with the number of features. That is, a model with more features requires more memory and more computational effort for its training. Small and fast models are key for applications with tight resource constraints, e.g., in embedded systems. Therefore, it is a common goal in ML pipelines to reduce the number of features with minimum information loss by a suitable transformation from raw data to training data, a strategy also known as dimensionality reduction (Van Der Maaten et al. 2009).

An important example of such a strategy is feature selection (FS), where the input dimension is reduced by selecting only a subset of all available features without performing additional transformations (Chandrashekar and Sahin 2014). This approach is especially effective when dealing with data from sources that produce many redundant or irrelevant features, which can be eliminated without significantly impacting the output quality. Consider, for example, trying to diagnose a specific disease from a vast array of medical measurements such as body temperature, concentration of various substances in a patient’s blood, or heart rate. Reducing the number of features that are necessary for this diagnosis not only allows for smaller models but might even help experts pin down what causes the disease. Consequently, FS can be seen as a tool to both reduce model complexity and improve ML interpretability.

The main contribution of this paper consists of two parts. First, we propose a novel FS algorithm for selecting a specific number of features using a quadratic unconstrained binary optimization formulation that can be applied to any combination of redundancy and importance measures. And second, we benchmark our proposed algorithm on different data sets using both classical and actual quantum hardware to demonstrate its effectiveness.

The remaining paper is organized as follows. First, in Section 2, we introduce our FS algorithm. Subsequently, we discuss in Section 3 how unconstrained binary optimization problems, on which our FS algorithm is based, can be solved. We relate our contribution to previous research results in Section 4. In Section 5, we perform and evaluate a series of numerical experiments. Finally, we close with a conclusion in Section 6.

2 Method

In the present section, we describe our proposed FS algorithm using a quadratic unconstrained binary optimization (QUBO) problem, which can be solved either by classical methods or with quantum computing. We start with a general problem definition and subsequently prove that the number of features can be selected by tuning a user-defined weighting parameter. Based on these prerequisites, we present our FS algorithm.

2.1 QUBO feature selection

Presume a classification task on a data set \(\mathcal {D}:=\lbrace (\boldsymbol {x}^{i},y^{i})\rbrace _{i\in [N]}\) with n-dimensional features \(\boldsymbol {x}^{i} \in \mathcal {X} \subseteq \mathbb {R}^{n}\) and class labels \(y^{i} \in \mathcal {Y} \subseteq \mathbb {N}\) for all i ∈ [N], where [N] denotes the set \(\lbrace 1,\dots ,N\rbrace \). The problem of FS corresponds to finding a subset S ⊂ [n] of these n features, such that the reduced data set \(\mathcal {D}_{S}=\lbrace ({\boldsymbol {x}^{i}_{S}},y^{i})\rbrace _{i\in [n]}\) with xS := (xj)jS leads to comparable performance as the original data for some data-driven task, such as classification. Typically, this subset is found by solving a suitably posed optimization problem, which can also explicitly depend on the classification model.

We propose a model-independent formulation

$$ \boldsymbol{x}^{*} := \underset{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}{\arg\min} Q(\boldsymbol{x},\alpha) $$
(1)

to obtain the selected features \(\boldsymbol {x}^{*} \in \mathcal {X}^{*} \subset \mathcal {X}\). We represent the subset S as a binary indicator vector \(\boldsymbol {x}:=(x_{1},\dots ,x_{n})\in \lbrace 0,1\rbrace ^{n}\), such that iS if and only if xi = 1 for all i ∈ [n]. The objective function reads

$$ Q(\boldsymbol{x},\alpha) := - \alpha \sum\limits_{i=1}^{n} I_{i} x_{i} + (1-\alpha) \sum\limits_{i,j=1}^{n} R_{ij} x_{i} x_{j}, $$
(2)

where the user-defined parameter α ∈ [0,1] balances the influence of the two terms, which we specify in the following as importance term and redundancy term. The importance term contains the elements

$$ I_{i} := I(x_{i};y) \geq 0 $$
(3)

of the importance vector \(\boldsymbol {I}\in \mathbb {R}_{0+}^{n}\). The importance vector represents the mutual information I(xi;y) of the individual features \(x_{1},\dots ,x_{n}\) with the class label y and is therefore a measure for the importance of each feature. In our objective function, the importance is maximized. Furthermore, the redundancy term contains the elements

$$ R_{ij} := I(x_{i};x_{j}) \geq 0 $$
(4)

of the pairwise redundancy matrix \(\boldsymbol {R}\in \mathbb {R}^{n \times n}\), which by definition is symmetric and positive semidefinite. This matrix represents the mutual information (MI) I(xi;xj) among the individual features and therefore measures their redundancy. For i = j, we set Rii = 0, since a feature is not redundant by itself. In our objective function, the redundancy is minimized.

The calculation of mutual information requires explicit knowledge about the joint probability mass function of features and labels and the corresponding marginals, which are in general difficult to estimate empirically for real-valued data. Therefore, we map all available feature values from the data set \(\mathcal {D}\) into B discrete bins. Specifically, for each separate feature dimension i, we take all /B+ 1-quantiles for \(\ell \in \lbrace 0,\dots ,B\rbrace \), which we denote by \(q^{\ell }_{i}\). With these, we define bins \({\mathscr{B}}^{\ell }_{i}\) as intervals \([q^{\ell -1}_{i},q^{\ell }_{i})\) for ∈ [B − 1], and \({{\mathscr{B}}^{B}_{i}}=[q^{B-1}_{i},{q^{B}_{i}}]\). Finally, we set \({b_{i}^{j}}:=\ell \) for the single that fulfills \({x_{i}^{j}}\in {\mathscr{B}}^{\ell }_{i}\). Since the labels are discrete by definition, no binning is necessary in \(\mathcal {Y}\). This way, we obtain a discretized data set \(\hat {\mathcal {D}}=\lbrace (\boldsymbol {b}^{i},y^{i})\rbrace _{i\in [N]}\) with \({b^{j}_{i}}\in [B]\) for all j ∈ [N] and i ∈ [n].

The empirical probability mass function after discretization reads

(5)

with the indicator function

(6)

defined for logical statements P. Consequently, we can approximate the information entropies in Eqs. 3 and 4 as

$$ I(x_{i},y) \approx \sum\limits_{b\in[B]}\sum\limits_{y\in\mathcal{Y}}\hat{p}_{X_{i},Y}(b,y)\log\left( \frac{\hat{p}_{X_{i},Y}(b,y)}{\hat{p}_{X_{i}}(b)\hat{p}_{Y}(y)}\right) $$
(7)

and

$$ I(x_{i},x_{j}) \approx \sum\limits_{b\in[B]} \sum\limits_{b^{\prime}\in[B]}\hat{p}_{X_{i},X_{j}}(b,b^{\prime})\log\left( \frac{\hat{p}_{X_{i},X_{j}}(b,b^{\prime})}{\hat{p}_{X_{i}}(b)\hat{p}_{X_{j}}(b^{\prime})}\right), $$
(8)

respectively, where we make use of the marginals

$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i},X_{j}}(b_{i},b_{j}) := \sum\limits_{y\in\mathcal{Y}, b_{k}\in[B] \forall k \neq i,j} \hat{p}(\boldsymbol{b},y), \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i},Y}(b_{i},y) := \sum\limits_{b_{k}\in[B]\forall k \neq i} \hat{p}(\boldsymbol{b},y), \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i}}(b_{i}) := \sum\limits_{y\in\mathcal{Y}, b_{k}\in[B] \forall k \neq i} \hat{p}(\boldsymbol{b},y), \end{array} $$
(11)

and

$$ \begin{array}{@{}rcl@{}} \hat{p}_{Y}(y) := \sum\limits_{b_{k}\in[B] \forall k} \hat{p}(\boldsymbol{b},y) \end{array} $$
(12)

for the probability mass functions of features subsets and labels. Discretization allows us to approximate the MI values while greatly simplifying the estimation procedure, since no assumption on the underlying probability distribution of the data is required. Moreover, the estimation is consistent, i.e., when the number of bins approaches infinity, we will recover the true MI between continuous features (Mandros et al. 2020, Theorem 3.2). To simplify our notation, we omit the dependence of Rij and Ii on B.

Formally, Eq. 1 represents a QUBO problem. For this reason, we call our method QUBO feature selection, or QFS for short. A QUBO objective function, Eq. 2, is typically written in a quadratic form

$$ Q(\boldsymbol{x},\alpha) = \boldsymbol{x}^{\intercal} \boldsymbol{Q}(\alpha) \boldsymbol{x} $$
(13)

with a QUBO matrix Q(α). The elements of this matrix read

$$ Q_{ij}(\alpha) = R_{ij}-\alpha (R_{ij} + \delta_{ij} I_{i} ), $$
(14)

where denotes the Kronecker delta. The solution of this QUBO instance, Eq. 1, represents the optimal feature subset. We provide a short review of QUBOs and their solution strategies in Section 3. A complete pipeline of how FS is performed according to our proposed framework is shown in Fig. 1.

Fig. 1
figure 1

Our proposed Quantum Feature Selection pipeline: From a given data set, the redundancy matrix R and the importance vector I are calculated, Eqs. 3 and 4. They are combined by interpolation with a factor α after the sign of I has been flipped, which results in a QUBO matrix, Eq. 2. The corresponding QUBO problem, Eq. 1, is solved either through quantum computing or classical solvers. The resulting binary solution vector x is a bit mask that indicates the selected features

2.2 Controlling the number of selected features

For FS, it is of particular interest to be able to select a specific number of features. Formally, this can be realized by adding a constraint to Eq. 1 such that the selection of k features is enforced. The resulting constrained optimization problem reads

$$ \boldsymbol{x}^{*} := \underset{\underset{\text{s. t.} ~\|{\boldsymbol{x}\|}_{1}=k}{x_{i}\in\lbrace 0,1\rbrace \forall i\in [n]}} {\arg\min} Q(\boldsymbol{x},\alpha) $$
(15)

and is formally not a QUBO in contrast to Eq. 1. To come back to QUBO form, a straightforward approach is to add a penalty term such as \(\lambda \left (\left ({\sum }_{i}x_{i}\right )-k\right )^{2}\) to Q(x,α), which is only equal to zero for a selection of exactly k features. Here, λ represents a strictly positive factor called Lagrangian. The problem can then be solved to obtain the desired solution such that k = ∥x1.

However, two challenges arise from this approach. First, a suitable choice of λ is not clear and depends on the magnitude of Q(x,α). If it is chosen too small, the imposed constraint might be ignored for certain solutions. On the other hand, if it is chosen too large, it may lead to a very large value range and, consequently, to loss of precision. Summarized, having both very small and very large elements in the QUBO matrix limits the amount of scaling to amplify meaningful differences between loss values. Furthermore, it is not clear in the fist place whether a feasible solution to the constrained problem can be found at all.

Due to these difficulties, we propose an alternative strategy to specify the number of selected features. Instead of resorting to penalty terms, we use the fact that the choice of α itself can be used the number of features present in the solution. This presumption can be motivated by considering the extremal values of α ∈ [0,1]. If we set α = 0, all diagonal entries of Q(0) become 0 and we put full emphasis on redundancy. Trivially, both the empty set of features and any single feature is least redundant, so that the optimally selected set of features is either {} or {{i}|i ∈ [n]}. Conversely, if α = 1, the problem becomes linear with only negative coefficients, leading to a selection of all features as the optimal solution. Consequently, by iteratively varying α from 0 to 1 in sufficiently small steps, we observe that \(\lVert \boldsymbol {x}^{*}\rVert _{1}\) increases monotonically in steps of one from 0 to n. This observation suggests that we can tune α to obtain a subset of any desired size \(k\in \lbrace 0,1,\dots ,n\rbrace \). As we show with Proposition 1, this assumption is indeed true.

Proposition 1

For all Q(⋅,α) defined as in Eq. 13 and \(k\in \lbrace 0,1,\dots ,n\rbrace \), there is an α ∈ [0,1] such that \(\boldsymbol {x}^{*}\in \arg \min \limits _{\boldsymbol {x}}Q(\boldsymbol {x},\alpha )\) and \({\|\boldsymbol {x}^{*}\|}_{1}=k\).

The proof can be found in Appendix A. This result shows that we do not need additional constraints on the QUBO instance to control the number of features present in the global optimum.

2.3 QFS algorithm

From the result of Proposition 1, we can devise an algorithm that, if provided with R, I and k, returns an α such that an optimal feature subset vector xQ(α) has exactly k non-zero entries. For this purpose, we introduce

$$ Q^{*}_{k}(\alpha) := \underset{\underset{\text{s. t.} ~{\|\boldsymbol{x}\|}_{1}=k}{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}}{\min} Q(\boldsymbol{x},\alpha) $$
(16)

with 0 ≤ kn, i.e., the minimal function value of Q(x,α), Eq. 13, for a given α when the number of ones in the solution is restricted to k. Furthermore, we denote the minimum with respect to k (i.e., the global minimum) by

$$ Q^{*}(\alpha) := \min_{k} Q^{*}_{k}(\alpha). $$
(17)

Assuming we have an oracle for Q(α) that, given any Q(⋅,α), returns a global optimum, an appropriate value for α can be found in \(\mathcal {O}(\log n)\) steps through binary search. The value of α is not necessarily unique.

In addition to the procedure described above, we introduce a threshold 𝜖 ≥ 0, such that Qii(α), Eq. 14, is set to some small positive value μ > 0 if αIi < 𝜖 for all i ∈ [n]. That is, we perform the transition Qij(α)↦Qij(α,𝜖,μ) with

$$ Q_{ij}(\alpha,\epsilon, \mu) = \begin{cases} \mu & \text{if }i=j \land \alpha I_{i}<\epsilon \\ Q_{ij}(\alpha) & \text{otherwise} \end{cases} $$
(18)

for arbitrary but constant values of 𝜖 and μ. In our experiments, we observed that, when the importance value of feature i is very close to zero, it has virtually no influence on the function value, and the solvers we used tended to include or exclude it randomly. As we seek to minimize the number of necessary features, we add an artificial weight μ to exclude these features from the optimal solution and avoid randomness. The exact value of μ is not decisive as long as it is positive, which ensures that the respective feature cannot be part of an optimal solution (Glover et al. 2018, Lemma 1.0).

The proposed algorithm is sketched in Algorithm 1. This algorithm is the main contribution of this manuscript.

Algorithm 1
figure d

Our proposed QFS Algorithm: Binary search that, given an integer k, finds a value α, such that the optimal solution of a feature selection QUBO problem with matrix elements Qij(α,𝜖,μ), Eq. 18, contains k features.

3 Solving QUBOs

An integral part of our proposed QFS algorithm, Algorithm 1, is the solution of QUBOs, Eq. 1. QUBOs of the form

$$ \underset{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}{\min} \boldsymbol{x}^{\intercal} \boldsymbol{Q} \boldsymbol{x} $$
(19)

with symmetrical \(\boldsymbol {Q} \in \mathbb {R}^{n \times n}\) as in Eq. 13 are a popular class of optimization problems, known to be NP-hard (Pardalos and Jha 1992). Numerous practical optimization problems have been embedded into QUBO form, ranging from finance and economics (Laughhunn 1970; Hammer and Shlifer 1971) over satisfiability (Kochenberger et al. 2005) to ML tasks such as clustering (Kumar et al. 2018; Bauckhage et al. 2018), vector quantization (Bauckhage et al. 2020), support vector machines (Mücke et al. 2019; Date et al. 2020), and probabilistic graphical models (Mücke et al. 2019), to name a few.

3.1 Classical solvers

Motivated by its relevance for practical problems, a wide range of classical methods to solve QUBOs have been developed, both exactly and approximately. A comprehensive overview of applications and solution methods can be found, e.g., in Kochenberger et al. (2014). Notable among the heuristic approaches are evolutionary algorithms and simulated annealing, for which highly efficient special-purpose hardware has been developed (Mücke et al. 2019; Matsubara et al. 2020) that can quickly find good approximate solutions to QUBO instances with several thousand variables. Another heuristic solver is qbsolv (Booth et al. 2017), which finds approximate solutions of a QUBO instance by iteratively partitioning it into smaller sub-problems, which are solved through a variation of tabu search (Glover and Laguna 1998). The sub-problems are created by “clamping” certain bits of a current solution, i.e., treating them as fixed and only optimizing over the remaining bits. The bits to be optimized are selected according to their impact on the objective function value, i.e., its increase when negating them in the current best solution.

3.2 Quantum solvers

In recent years, quantum computing has opened up a promising approach to solving QUBO instances, which — among its many other applications — makes it interesting for performing FS. For this purpose, any QUBO problem can be encoded in form of a Hamiltonian

$$ \hat{H}={\sum}_{\underset{i \neq j}{i,j=0}}^{n} a_{ij}\hat{\sigma}_{i}\hat{\sigma}_{j} + {\sum}_{i=0}^{n} b_{i}\hat{\sigma}_{i} + c, $$
(20)

where \(\hat {\sigma }_{i}\) denotes a Pauli-z matrix acting on qubit i with eigenvalues ± 1 and corresponding eigenstates |±i〉. The coefficients \(a_{ij},b_{i},c\in \mathbb {R}\) can be found by performing the transformation \(x_{i} \mapsto (1-\hat {\sigma }_{i})/2\) of the objective function in Eq. 19 with \(\hat {\sigma }_{i}^{2}=1\). Since the Pauli spin matrices of different qubits commute, the minimum eigenstate of the Hamiltonian |Ψ〉 with \(\hat {H} |{\Psi }\rangle = E |{\Psi }\rangle \) can be written in terms of |Ψ〉 = |ψ1〉⊗⋯ ⊗|ψn〉, where |ψi〉∈{|+i〉,|−i〉}∀i ∈ [1,n]. Therefore, the eigenvalue \(E = \min \limits _{\boldsymbol {x}\in \lbrace 0,1\rbrace ^{n}} \boldsymbol {x}^{\intercal } \boldsymbol {Q} \boldsymbol {x}\) represents the minimum objective function value of the QUBO. The corresponding solution vector \(\boldsymbol {x}^{*} = \arg \min \limits _{\boldsymbol {x}\in \lbrace 0,1\rbrace ^{n}} \boldsymbol {x}^{\intercal } \boldsymbol {Q} \boldsymbol {x} \) can be obtained from the eigenstate |Ψ〉 with the assignment \(x_{i}^{*} = \vert \langle {-_{i}|{\Psi }\rangle }\vert ^{2}\) based on a projective measurement of each qubit (and is not necessarily unique). Summarized, the QUBO problem can be transformed into the problem of finding the minimum eigenstate (or ground state) |Ψ〉 of \(\hat {H}\).

However, finding the ground state of a Hamiltonian is in general also a non-trivial problem, and possible solution strategies depend on the properties of the quantum hardware and the shape of \(\hat {H}\). Two common approaches are the variational quantum eigensolver (VQE) (Peruzzo et al. 2014), which is suitable for quantum gate computers, and quantum annealing (QA) (Kadowaki and Nishimori 1998; Morita and Nishimori 2008), which is suitable for quantum annealers. VQE uses a hybrid quantum-classical computational approach (McClean et al. 2016) to minimize the expected energy \(\langle {\Psi (\boldsymbol {\theta })}|\hat {H}|{\Psi (\boldsymbol {\theta })}\rangle \). For this purpose, a parametric ansatz \(\hat {U}(\boldsymbol {\theta })\) is prepared on a quantum gate computer such that \(|{\Psi (\boldsymbol {\theta })}\rangle = \hat {U}(\boldsymbol {\theta })|{0}\rangle \). The circuit parameters 𝜃 are learned with a classical optimization in order to find an estimate for the ground state |Ψ〉≈|Ψ(𝜃)〉. In contrast, QA exploits the adiabatic theorem (Farhi et al. 2000) by preparing the ground state |Ψ0〉 of a simple mixing Hamiltonian \(\hat {H}_{0}\) and then slowly transferring the system into the ground state |Ψ〉 of the target Hamiltonian \(\hat {H}\) through adiabatic time evolution (Gruber 1999). A hybrid approach can be realized by splitting the initial QUBO into smaller sub-problems and solving each with QA, e.g., by using the qbsolv strategy explained above.

Since both algorithms are heuristic, typically multiple samples (or “shots”) of the solution are obtained from repeated measurements.

4 Related research

There are numerous approaches of defining optimal feature subsets and finding or approximating them (Chandrashekar and Sahin 2014).

Wrapper methods directly use the performance of classification or regression models as a criterion for selecting features (John et al. 1994). As the model must be re-trained on every candidate subset, this method is very resource-intensive. Finding the optimal subset requires brute-forcing all 2n possible subsets, which is generally intractable for large feature sets and non-trivial models. Instead, heuristic optimization schemes are often used, such as greedy search or evolutionary algorithms (Leardi et al. 1992; Siedlecki and Sklansky 1993).

Filter methods use a measure of relevance, e.g., correlation or MI with the label or target variable, for ranking features and discarding those below a certain threshold. While those methods are very easy to compute and work well in practice, they often do not consider redundancy between selected features, leading to subsets that could potentially be much smaller. When considering redundancy, such as pairwise MI between selected features, as an additional criterion to be minimized, the optimization problem becomes non-linear and NP-hard.

In Rodriguez-Lujan et al. (2010), this problem is formulated as a quadratic programming (QP) task, which is solved approximately by means of dimension reduction. The resulting solution is a real-valued weight vector that is used for ranking the features. This last step is purely heuristic, as the QP task is merely a relaxation of the corresponding QUBO problem with binary weight variables which we discuss in this article. QUBOs become prospectively tractable through quantum computation, which is why we do not resort to approximations or relaxations in our approach to make our method feasible.

QUBO formulation based on redundancy and importance appeared recently in literature (Tanahashi et al. 2018; Otgonbaatar and Datcu 2021); however, the authors rely on a penalty term with Lagrange multiplier λ for restricting the solution to k features. The choice of λ is not trivial and can drastically increase the dynamic range of the QUBO coefficients if chosen too large, as discussed above. In our approach, we observe that we can weigh redundancy and importance against each other in order to control the number of features present in the optimal solution, rendering any additional constraints obsolete.

Another approach relies on quantum gate computing to apply Grover’s algorithm to an oracle that yields the improvement in accuracy by adding or removing single features (He et al. 2018). This is merely a quantum version of a simple sequential wrapper approach, which improves the theoretical runtime for greedily selecting the next feature to be inserted or removed, as Grover’s algorithm finds the minimum or maximum of a function with logarithmic time complexity. The solution quality is the same as for a classical greedy algorithm — or potentially worse, as Grover’s algorithm is probabilistic.

5 Experiments

In Section 2, we have presented our novel QFS algorithm, Algorithm 1. The current section contains a study of four different experiments to evaluate the performance of QFS.

For this purpose, we use six data sets, both synthetic and taken from real-world data sources. All data sets are listed in Table 1, and a detailed description can be found in Appendix B. For discretizing the data as described in Section 2.1, we choose B = 20. Furthermore, we set 𝜖 = 10− 8 and \(\mu =\max \limits _{i,j\in [n]}Q_{ij}(\alpha )\) in Algorithm 1. These parameters are determined empirically by careful testing.

Table 1 Data sets used for our numerical experiments. For each data set, we list the number of features n, the number of classes c, and the number of samples S. A detailed description can be found in Appendix B

Our first experiment in Section 5.1 serves as a proof of principle, in which we evaluate whether QFS is able to find any useful features at all. This is realized by selecting 30 features from the mnist data set through QFS and training separate 1-vs-all classifiers (on all digits). To evaluate the usefulness of the selected features, we compare the performance of these classifiers to those trained on 30 random features and all available features, respectively. The second experiment in Section 5.2 is a wider empirical comparison of various combinations of commonly used FS methods and ML models with the goal to show that QFS is competitive. In Section 5.3, we apply QFS in a more application-oriented setting by using the selected features as a means of lossy data compression, interpreting the reduced feature space as a latent representation and training a convolutional neural network to reconstruct the original features. Finally, we use quantum hardware in Section 5.4 to solve two exemplary feature selection QUBO instances for QFS. This experiment demonstrates that our method can indeed be used with current NISQ devices.

5.1 Experiment 1: feature quality

The first experiment serves as a proof of principle, in which we verify that QFS is able to find informative features (Fig. 2).

Fig. 2
figure 2

Experiment 1: feature subsets found through Quantum Feature Selection on all separate digits of the mnist data set. The black pixels represent the selected features. The digits are ordered from left to right, starting with 0 on the left

5.1.1 Setup

For this purpose, we consider the mnist data set and run Algorithm 1 ten times with k = 30 such that we use about 3.8% of the original 784 features (i.e., pixels). To solve the QUBO instances classically, we use qbsolv (Booth et al. 2017). As redundancy, we use the pairwise MI matrix over all features, Eq. 4. As importance for digit d, we calculate the MI between the features xi and a binary variable , Eq. 3. This yields ten feature selections, each tailored towards one digit, which we use to perform a 1-vs-all classification. To quantify whether the selected features from our method are informative or not, we train a Random Forest classifier on these features and determine its accuracy. For comparison, we also determine the accuracy of a Random Forest classifier that has been trained on the whole feature set and a set of randomly selected features (uniformly sampled from the set of k-element subsets of [n] for each digit), respectively. Specifically, the Random Forest is composed of 100 estimators, each a Decision Tree of maximal depth five and a maximum of five features considered when searching the best split. These restrictions serve to limit the model size, as is a common objective in applications which require FS. We use the Python implementation provided by scikit-learn (Pedregosa et al. 2011).

5.1.2 Results

The selected features (pixels) per digit are shown in Section 5.1.2. We perform 10-fold cross validation and report mean and standard deviation of the classification accuracy, which is visualized in Fig. 3. For “Random subsets,” we report the cross-validated accuracy averaged over five random subsets, so mean and standard deviation are reported over 50 runs in total.

Fig. 3
figure 3

Experiment 1: 10-fold cross-validated accuracy of binary Random Forest classifiers of each separate digit of the mnist data set. The classifiers are trained on (1) all 784 features, (2) 30 randomly sampled features, and (3) 30 features selected by Quantum Feature Selection. The accuracy (10-fold CV) using random features is averaged over 5 different random subsets. The standard deviation (one sigma) is represented by the smaller error bars on top of the bars, which show the mean

We observe that the Random Forest model using the constrained estimators achieves the best results on all digits using the optimal features found through QFS. This confirms that our method indeed finds informative features that are useful for classification. Interestingly, the models trained on the QFS subsets not only outperform the models trained on random subsets, but also those using all features. This is probably due to the restricted number of features per split in the base estimators, which leads to a higher chance of picking non-informative features when using all available pixel values.

5.2 Experiment 2: cross-model comparison with FS methods

In the second experiment, we contrast our method of QFS to other feature selection methods by comparing the accuracy of various classification models trained on the respective feature subsets in analogy to the previous experiment from Section 5.1.

5.2.1 Setup

Specifically, we apply Algorithm 1 to five different data sets to obtain feature subsets of fixed sizes k. We consider 30 features (3.8%) for mnist, five features (14.7%) for ionosphere, 20 features (4%) for madelon, 20 features (5%) for synth_100, and 5 features (23.8%) for waveform. For madelon and synth_100, we use the known number of informative features, while for the remaining data sets we chose the feature subset sizes arbitrarily in varying percentage ranges of the total number of features.

To evaluate the quality of the selected features, we compare QFS to three other heuristic FS methods:

  1. 1.

    Ranking obtained from the Euclidean norm over the coefficients of n 1-vs-all Logistic Regression (LR) models.

  2. 2.

    Ranking obtained from the impurity-based feature importances given by an Extra Trees Classifier (ET) with 100 estimators.

  3. 3.

    Recursive Feature Elimination (RFE) (Guyon et al. 2002) performed on a Decision Tree of maximal depth 10.

The resulting feature subsets are then used to train five classification models:

  1. 1.

    Neural Network

  2. 2.

    1-vs-all Logistic Regression

  3. 3.

    Decision Tree

  4. 4.

    Random Forest (of 100 decision trees)

  5. 5.

    Naive Bayes classifier (with Gaussian prior)

The Neural Network has a single hidden layer containing \(\lfloor \sqrt {k}+0.5\rfloor \) neurons to make the dependency between the number of parameters and selected features k linear. For the activation function, we use ReLU. Both the Neural Network and the Logistic Regression are limited to 1000 learning iterations. Again, we use Python implementations provided by scikit-learn (Pedregosa et al. 2011) for all models and FS methods.

On every model, we perform a 10-fold cross validation and report mean and standard deviation of the classification accuracy.

5.2.2 Results

The results are visualized in Fig. 4. The results of this experiment show that QFS, in general, compares favorably among FS methods. The RFE method using decision trees often leads to better accuracies, but comes at the cost of much higher computation time, owing to the fact that it is a wrapper method which requires the model to be re-trained in every iteration.

Fig. 4
figure 4

Experiment 2: 10-fold cross-validated accuracy of various classifiers using feature subsets produced by various feature selection methods on three different data sets. The standard deviation (one sigma) is represented by the smaller bars on top of the bars showing the mean

For the data sets madelon and synth_100, we know the ground-truth informative features, which allows us to evaluate the distance between the optimal feature subset and the subset found by each method. To this end, we use the edit distance between pairs of feature subsets, which is the number of features that need to be swapped in order to turn one subset into the other. Alternatively, it is the Hamming distance between the binary feature indicator vectors, divided by two. We represent each FS method used in this experiment as a node in an undirected graph and each edit distance between the feature vectors they produced as a weighted edge. Nodes of distance 0 are represented as cluster nodes. The resulting graphs are shown in Fig. 5.

Fig. 5
figure 5

Experiment 2: edit distances between feature subsets found through the different feature selection methods on the data sets (a) madelon and (b) synth_100 for which the ground-truth informative features are known. Each node represents a FS method, and each edge weight is the edit distance, i.e., the number of features that need to be swapped in order to convert one selection into the other. We show the ground truth (GT) as well as Logistic Regression ranking (LR), Extra Tree classifier ranking (ET), Recursive Feature Elimination (RFE), and our method (Quantum Feature Selection). Methods within the dotted circles yield the same feature subset

QFS is able to find all informative features for synth_100, and is closest to ground truth on madelon among all other methods except for the Extra Tree classifier ranking, which is able to find the ground-truth features in both cases. This result shows that MI, when used as measure of redundancy and importance in QFS, produces useful feature subsets on both subsets where ground truth is known. Moreover, FS on these data sets seems to work very well with the Extra Tree classifier ranking method, which indicates that impurity-based measures are particularly effective here.

5.3 Experiment 3: application: lossy compression with autoencoder

The previous experiments have confirmed that QFS picks features of a data set which are important and carry little redundancy according to our criteria of choice. From a different perspective, this can be interpreted as removing unimportant and redundant features, which leads to much smaller data sets. Consequently, QFS can be used as a type of lossy compression by computing an optimal feature subset S and discarding all features iS. From this compressed representation, we can reconstruct the original data points approximately by means of some ML model (Kingma and Welling 2013). This process is shown schematically in Fig. 6. In this third experiment, we evaluate the lossy compression empirically.

Fig. 6
figure 6

Experiment 3: compression and decompression pipeline using Quantum Feature Selection with k = 25 and a convolutional decoder. The features are being extracted by applying the feature selection mask to the image, which yields the compressed representation (compression rate of 25/784 = 3.19%). Then, the decoder is trained on minimizing the mean squared error between a reconstruction of the original image. The feature mask shown above is the actual feature mask used in the experiment

5.3.1 Setup

To realize this experiment, we use Algorithm 1 to perform QFS on the mnist data set with k = 25 (i.e., 3.19% of all features). The resulting subset S of pixel positions is interpreted as a latent space, i.e., a compressed data representation like one we would obtain from principal component analysis (PCA) or an autoencoder. While PCA is based on an eigendecomposition that is used to project data onto axes of maximal variance, the subspace S induced by the feature selection fulfills other optimality criteria based on redundancy and importance.

By feeding this latent representation into a convolutional neural network (CNN), we can reconstruct the original images by projecting back to a size of 28 × 28 and minimizing the difference between reconstruction and original. To this end, our CNN architecture consists of a linear input layer with k inputs and 392 outputs. The output is reshaped to 8 channels of size 7 × 7. Using two sequential 2D transposed convolution operations interspersed with ReLU activations, the images are first inflated to 16 × 14 × 14 and finally to 1 × 28 × 28. A sigmoid is applied to the output in order to obtain pixel values between 0 and 1. The model weights are trained by minimizing the mean squared error (MSE) between the original samples x and the reconstructions,

$$ \frac{1}{784}\|{\boldsymbol{x}-f_{\boldsymbol{\theta}}(\boldsymbol{x}_{S})}\|_{2}^{2} ~\rightarrow ~\min_{\boldsymbol{\theta}}, $$
(21)

where f𝜃 is the model function with weights 𝜃, and xS the sub-vector of x containing only the features in S found through QFS. We train the model for 1000 epochs with batches of size 250, using the Adam optimizer (Kingma and Ba 2014) from PyTorch (Paszke et al. 2019) with a 1cycle learning rate scheduler (Smith and Topin 2018) with maximum learning rate 0.01.

5.3.2 Results

The CNN achieves a MSE of 21.7389, which corresponds to an average squared pixel deviation of 0.0277. Figure 7 shows 20 random mnist samples on the left, and their respective reconstructions using the procedure described above.

Fig. 7
figure 7

Experiment 3: visual comparison of mnist samples reconstructed from 25 pixels at fixed positions selected through Quantum Feature Selection. The reconstruction is implemented by a convolutional neural network. The original mnist samples are shown in (a), and their corresponding reconstructions in (b)

The reconstructed samples are visually very similar to the originals, which suggests that the pixel positions found through QFS contain useful information about the samples that help to deduce the values of nearby pixels.

5.4 Experiment 4: QFS on quantum hardware

So far, we have only used classical QUBO solvers to perform QFS. In this final experiment, we consider QFS with actual quantum hardware as a QUBO solver, using both a quantum annealer and a quantum gate computer.

5.4.1 Setup

Based on QFS, we construct QUBO instances for three data sets, ionosphere, waveform, and synth_10 as before, but solve them using QA on a D-Wave quantum annealer and VQE on an IBM gate quantum computer as described in Section 3.2.

To conduct this experiment, we first perform an entirely classical run of Algorithm 1 to obtain values for α for a predefined number of selected features k. In principle, this search for α can be done on quantum hardware as well. However, we can reduce the quantum computation time significantly by pre-computing α classically. This is crucial since access of the quantum gate hardware is time-consuming, especially for the IBM gate quantum computer. For this reason, we also only consider the synth_10 data set for the VQE algorithm. The resulting classically obtained values of α are listed in Table 2 together with the chosen number of selected features k. Using these values, we assemble a QUBO instance for each data set and solve these QUBO instances using both hardware approaches.

Table 2 Experiment 4: For each of the three data sets of interest, we list the number of features n, the chosen number of selected features k, and the resulting value of α from a classical evaluation of Algorithm 1 used for the quantum experiment

The D-Wave quantum annealer Advantage 5.1 is accessed via D-Wave’s cloud service LeapFootnote 1. It operates on 5627 qubits (Update 2021). We use the QA implementation provided by oceanFootnote 2 with the DWaveSampler and default parameters. We evaluate the QUBO instances for all three data sets. In total, we obtain 1024 samples per QUBO instance, each representing an estimate of the solution. Additionally, we perform the same experiment using simulated annealing, again using D-Wave’s Python implementation contained in ocean with default parameters.

The IBM quantum gate computer ibmq_ehningen, on the other hand, operates on a Falcon r5.11 processor with 27 qubitsFootnote 3. It is accessible via IBM’s cloud service IBM QuantumFootnote 4. We use Qiskit Runtime’s default VQE implementation (ANIS et al. 2021) with the simultaneous perturbation stochastic approximation (SPSA) optimizer (Spall 1998). We let the optimizer run for 32 iterations with Qiskit’s default parameters. As our parametric ansatz, we chose a Pauli two-design (Nakata et al. 2017) with four layers. Since the computations on the quantum gate hardware are time-consuming, we only evaluate the QUBO instances for one of the three data sets, the synth_10 data set. Again, we obtain 1024 samples from the resulting circuit.

5.4.2 Results

The results for QA are shown in Fig. 8. On the left, we show all 1024 samples sorted in ascending order of energy, such that the lowest measured energies are on the left. Mean and standard deviation are reported over the sorted sequences of 16 runs. The globally optimal energies, which we found by brute force, are shown as horizontal lines. On the right, we show for each solution bit (corresponding to feature indices for QFS) the number of times the corresponding bit was measured as 1 across all shots. Again, we report mean and standard deviation over 16 runs. The color of each bar indicates whether the corresponding global optimum of the respective bit is 0 or 1. The sequence of optimal bits corresponds to the optimal feature selection for our application. Figure 9 is analogous to Fig. 8, showing the results obtained through simulated annealing.

Fig. 8
figure 8

Experiment 4: histograms of samples from the D-Wave quantum annealer performing Quantum Feature Selection on the data sets (a) synth_10, (b) waveform and (c) ionosphere described in Table 1. For each run, 1024 samples were performed and the energy and occurrences of all solutions recorded. Left: energy values of all shots sorted in ascending order, reported as mean and standard deviation over 16 runs. The globally optimal energies (found by brute force) are shown as horizontal lines. Right: index of each solution bit (corresponding to feature index for Quantum Feature Selection) versus the number of times the corresponding bit was measured to be 1 over all shots, again reported as mean and standard deviation over all 16 runs. The color of each bar indicates whether the corresponding global optimum of the respective bit is 0 or 1. This global optimum represents the optimal selection of features as given by Quantum Feature Selection

Fig. 9
figure 9

Experiment 4: same results as in Fig. 8, but using simulated annealing instead of the D-Wave machine. Optimal bits are found more frequently compared to Quantum Annealing, which is probably due to additional noise introduced by the quantum hardware device

The histograms show that there is a clear correspondence between feature optimality (bar color) and the number of occurrences, which indicates that QA is able to find the global optimum in a certain fraction of samples. With increasing number of qubits, this correlation gets noticeably less pronounced. To be precise, the optimum was found in 10.78 ± 5.12% for synth_10, 0.18 ± 0.18% for waveform, and only a single time across all 16 runs for ionosphere. In contrast, simulated annealing finds the correct bits with higher probability, even for data of higher dimension: The optima are found in 100.00 ± 0.00% of shots for synth_10, 20.39 ± 1.40% for waveform and 21.04 ± 1.02% for ionosphere. This result indicates that the use of NISQ hardware compromises the solution quality, possibly due to loss of precision when loading the QUBO parameters onto the quantum annealer, or additional noise introduced by read-out errors.

The result for VQE is shown in Fig. 10 in analogy to Fig. 8. Mean and standard deviation are obtained from 5 runs. We find no clear correspondence between globally optimal bits and number of occurrences. The global optimum was only found in 0.14 ± 0.08% of measurements. We assume that hardware noise, as well as the low number of optimization steps that were used due to long run times, lead to this performance. In particular, we expect that VQE could perform better for longer run times and less noisy hardware. It is important to understand that in contrast to the D-Wave results, where each shot represents one approximation run to find the underlying QFS optimum, the 1024 VQE samples were all taken from one optimized circuit.

Fig. 10
figure 10

Experiment 4: histograms of samples from the IBM quantum gate computer performing Quantum Feature Selection on the synth_10 data set in analogy to Fig. 8. Measurements of all qubits from the resulting circuit are treated in the same way as the samples from the D-Wave quantum annealer. While the optimal energy (horizontal line) is reached for some samples, the overall probability to observe this global optimum is significantly lower compared with the quantum annealer. This may be due to insufficient convergence or hardware noise

In summary, we find that near-term quantum devices can in principle be used for QFS, especially for low data dimensions and on special-purpose hardware like quantum annealers that are designed to solve QUBO problems.

6 Conclusion

In this article, we present a novel algorithm for performing feature selection based on a generalized QUBO embedding, which can be solved on both classical and quantum hardware. To this end, we use mutual information as a basis for measures of importance and redundancy, which we balance using an interpolation factor α. We show theoretically and empirically that our method allows the selection of feature subsets of any desired size k without resorting to constraints on the solution space.

To demonstrate our framework’s effectiveness, we have performed a range of experiments, comparing different common features selection methods and the resulting performance on different ML models. Furthermore, we have also realized a practical application for lossy data compression.

One of our experiments has been run on actual quantum hardware, which further demonstrates that our algorithm is viable and NISQ-compatible. Our experiments are conducted on rather low-dimensional problems; however, this experimental setup is dictated by the currently available hardware, and we expect our algorithm to scale in accordance with future quantum computing developments.

Our framework can easily be modified by changing measures of importance and redundancy. Other choices instead of MI include entropy, Pearson’s correlation, or other information-theoretic measures. Even expert knowledge can be incorporated, if available. As Proposition 1 is valid for general I and R with non-negative entries, the proof holds for any combination of importance and redundancy measures, and Algorithm 1 can be applied accordingly.