Feature selection on quantum computers

Mücke, Sascha; Heese, Raoul; Müller, Sabine; Wolter, Moritz; Piatkowski, Nico

doi:10.1007/s42484-023-00099-z

Feature selection on quantum computers

Research Article
Open access
Published: 20 February 2023

Volume 5, article number 11, (2023)
Cite this article

Download PDF

You have full access to this open access article

Quantum Machine Intelligence Aims and scope Submit manuscript

Feature selection on quantum computers

Download PDF

3422 Accesses
11 Citations
2 Altmetric
Explore all metrics

Abstract

In machine learning, fewer features reduce model complexity. Carefully assessing the influence of each input feature on the model quality is therefore a crucial preprocessing step. We propose a novel feature selection algorithm based on a quadratic unconstrained binary optimization (QUBO) problem, which allows to select a specified number of features based on their importance and redundancy. In contrast to iterative or greedy methods, our direct approach yields higher-quality solutions. QUBO problems are particularly interesting because they can be solved on quantum hardware. To evaluate our proposed algorithm, we conduct a series of numerical experiments using a classical computer, a quantum gate computer, and a quantum annealer. Our evaluation compares our method to a range of standard methods on various benchmark data sets. We observe competitive performance.

Genetic algorithms: theory, genetic operators, solutions, and applications

Article 03 February 2023

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Article 09 April 2023

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

Article Open access 19 April 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning (ML) models excel in a wide range of data analytics tasks, such as classification, regression, and data generation. However, most model families scale with the number of input data dimensions, i.e., with the number of features. That is, a model with more features requires more memory and more computational effort for its training. Small and fast models are key for applications with tight resource constraints, e.g., in embedded systems. Therefore, it is a common goal in ML pipelines to reduce the number of features with minimum information loss by a suitable transformation from raw data to training data, a strategy also known as dimensionality reduction (Van Der Maaten et al. 2009).

An important example of such a strategy is feature selection (FS), where the input dimension is reduced by selecting only a subset of all available features without performing additional transformations (Chandrashekar and Sahin 2014). This approach is especially effective when dealing with data from sources that produce many redundant or irrelevant features, which can be eliminated without significantly impacting the output quality. Consider, for example, trying to diagnose a specific disease from a vast array of medical measurements such as body temperature, concentration of various substances in a patient’s blood, or heart rate. Reducing the number of features that are necessary for this diagnosis not only allows for smaller models but might even help experts pin down what causes the disease. Consequently, FS can be seen as a tool to both reduce model complexity and improve ML interpretability.

The main contribution of this paper consists of two parts. First, we propose a novel FS algorithm for selecting a specific number of features using a quadratic unconstrained binary optimization formulation that can be applied to any combination of redundancy and importance measures. And second, we benchmark our proposed algorithm on different data sets using both classical and actual quantum hardware to demonstrate its effectiveness.

The remaining paper is organized as follows. First, in Section 2, we introduce our FS algorithm. Subsequently, we discuss in Section 3 how unconstrained binary optimization problems, on which our FS algorithm is based, can be solved. We relate our contribution to previous research results in Section 4. In Section 5, we perform and evaluate a series of numerical experiments. Finally, we close with a conclusion in Section 6.

2 Method

In the present section, we describe our proposed FS algorithm using a quadratic unconstrained binary optimization (QUBO) problem, which can be solved either by classical methods or with quantum computing. We start with a general problem definition and subsequently prove that the number of features can be selected by tuning a user-defined weighting parameter. Based on these prerequisites, we present our FS algorithm.

2.1 QUBO feature selection

Presume a classification task on a data set $\mathcal {D}:=\lbrace (\boldsymbol {x}^{i},y^{i})\rbrace _{i\in [N]}$ with n-dimensional features $\boldsymbol {x}^{i} \in \mathcal {X} \subseteq \mathbb {R}^{n}$ and class labels $y^{i} \in \mathcal {Y} \subseteq \mathbb {N}$ for all i ∈ [N], where [N] denotes the set $\lbrace 1,\dots ,N\rbrace $. The problem of FS corresponds to finding a subset S ⊂ [n] of these n features, such that the reduced data set $\mathcal {D}_{S}=\lbrace ({\boldsymbol {x}^{i}_{S}},y^{i})\rbrace _{i\in [n]}$ with x_S := (x_j)_j∈S leads to comparable performance as the original data for some data-driven task, such as classification. Typically, this subset is found by solving a suitably posed optimization problem, which can also explicitly depend on the classification model.

We propose a model-independent formulation

$$ \boldsymbol{x}^{*} := \underset{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}{\arg\min} Q(\boldsymbol{x},\alpha) $$

(1)

to obtain the selected features $\boldsymbol {x}^{*} \in \mathcal {X}^{*} \subset \mathcal {X}$. We represent the subset S as a binary indicator vector $\boldsymbol {x}:=(x_{1},\dots ,x_{n})\in \lbrace 0,1\rbrace ^{n}$, such that i ∈ S if and only if x_i = 1 for all i ∈ [n]. The objective function reads

$$ Q(\boldsymbol{x},\alpha) := - \alpha \sum\limits_{i=1}^{n} I_{i} x_{i} + (1-\alpha) \sum\limits_{i,j=1}^{n} R_{ij} x_{i} x_{j}, $$

(2)

where the user-defined parameter α ∈ [0,1] balances the influence of the two terms, which we specify in the following as importance term and redundancy term. The importance term contains the elements

$$ I_{i} := I(x_{i};y) \geq 0 $$

(3)

of the importance vector $\boldsymbol {I}\in \mathbb {R}_{0+}^{n}$. The importance vector represents the mutual information I(x_i;y) of the individual features $x_{1},\dots ,x_{n}$ with the class label y and is therefore a measure for the importance of each feature. In our objective function, the importance is maximized. Furthermore, the redundancy term contains the elements

$$ R_{ij} := I(x_{i};x_{j}) \geq 0 $$

(4)

of the pairwise redundancy matrix $\boldsymbol {R}\in \mathbb {R}^{n \times n}$, which by definition is symmetric and positive semidefinite. This matrix represents the mutual information (MI) I(x_i;x_j) among the individual features and therefore measures their redundancy. For i = j, we set R_ii = 0, since a feature is not redundant by itself. In our objective function, the redundancy is minimized.

The calculation of mutual information requires explicit knowledge about the joint probability mass function of features and labels and the corresponding marginals, which are in general difficult to estimate empirically for real-valued data. Therefore, we map all available feature values from the data set $\mathcal {D}$ into B discrete bins. Specifically, for each separate feature dimension i, we take all ℓ/_B+ 1-quantiles for $\ell \in \lbrace 0,\dots ,B\rbrace $, which we denote by $q^{\ell }_{i}$. With these, we define bins ${\mathscr{B}}^{\ell }_{i}$ as intervals $[q^{\ell -1}_{i},q^{\ell }_{i})$ for ℓ ∈ [B − 1], and ${{\mathscr{B}}^{B}_{i}}=[q^{B-1}_{i},{q^{B}_{i}}]$. Finally, we set ${b_{i}^{j}}:=\ell $ for the single ℓ that fulfills ${x_{i}^{j}}\in {\mathscr{B}}^{\ell }_{i}$. Since the labels are discrete by definition, no binning is necessary in $\mathcal {Y}$. This way, we obtain a discretized data set $\hat {\mathcal {D}}=\lbrace (\boldsymbol {b}^{i},y^{i})\rbrace _{i\in [N]}$ with ${b^{j}_{i}}\in [B]$ for all j ∈ [N] and i ∈ [n].

The empirical probability mass function after discretization reads

(5)

with the indicator function

(6)

defined for logical statements P. Consequently, we can approximate the information entropies in Eqs. 3 and 4 as

$$ I(x_{i},y) \approx \sum\limits_{b\in[B]}\sum\limits_{y\in\mathcal{Y}}\hat{p}_{X_{i},Y}(b,y)\log\left( \frac{\hat{p}_{X_{i},Y}(b,y)}{\hat{p}_{X_{i}}(b)\hat{p}_{Y}(y)}\right) $$

(7)

and

$$ I(x_{i},x_{j}) \approx \sum\limits_{b\in[B]} \sum\limits_{b^{\prime}\in[B]}\hat{p}_{X_{i},X_{j}}(b,b^{\prime})\log\left( \frac{\hat{p}_{X_{i},X_{j}}(b,b^{\prime})}{\hat{p}_{X_{i}}(b)\hat{p}_{X_{j}}(b^{\prime})}\right), $$

(8)

respectively, where we make use of the marginals

$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i},X_{j}}(b_{i},b_{j}) := \sum\limits_{y\in\mathcal{Y}, b_{k}\in[B] \forall k \neq i,j} \hat{p}(\boldsymbol{b},y), \end{array} $$

(9)

$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i},Y}(b_{i},y) := \sum\limits_{b_{k}\in[B]\forall k \neq i} \hat{p}(\boldsymbol{b},y), \end{array} $$

(10)

$$ \begin{array}{@{}rcl@{}} \hat{p}_{X_{i}}(b_{i}) := \sum\limits_{y\in\mathcal{Y}, b_{k}\in[B] \forall k \neq i} \hat{p}(\boldsymbol{b},y), \end{array} $$

(11)

and

$$ \begin{array}{@{}rcl@{}} \hat{p}_{Y}(y) := \sum\limits_{b_{k}\in[B] \forall k} \hat{p}(\boldsymbol{b},y) \end{array} $$

(12)

for the probability mass functions of features subsets and labels. Discretization allows us to approximate the MI values while greatly simplifying the estimation procedure, since no assumption on the underlying probability distribution of the data is required. Moreover, the estimation is consistent, i.e., when the number of bins approaches infinity, we will recover the true MI between continuous features (Mandros et al. 2020, Theorem 3.2). To simplify our notation, we omit the dependence of R_ij and I_i on B.

Formally, Eq. 1 represents a QUBO problem. For this reason, we call our method QUBO feature selection, or QFS for short. A QUBO objective function, Eq. 2, is typically written in a quadratic form

$$ Q(\boldsymbol{x},\alpha) = \boldsymbol{x}^{\intercal} \boldsymbol{Q}(\alpha) \boldsymbol{x} $$

(13)

with a QUBO matrix Q(α). The elements of this matrix read

$$ Q_{ij}(\alpha) = R_{ij}-\alpha (R_{ij} + \delta_{ij} I_{i} ), $$

(14)

where denotes the Kronecker delta. The solution of this QUBO instance, Eq. 1, represents the optimal feature subset. We provide a short review of QUBOs and their solution strategies in Section 3. A complete pipeline of how FS is performed according to our proposed framework is shown in Fig. 1.

2.2 Controlling the number of selected features

For FS, it is of particular interest to be able to select a specific number of features. Formally, this can be realized by adding a constraint to Eq. 1 such that the selection of k features is enforced. The resulting constrained optimization problem reads

$$ \boldsymbol{x}^{*} := \underset{\underset{\text{s. t.} ~\|{\boldsymbol{x}\|}_{1}=k}{x_{i}\in\lbrace 0,1\rbrace \forall i\in [n]}} {\arg\min} Q(\boldsymbol{x},\alpha) $$

(15)

and is formally not a QUBO in contrast to Eq. 1. To come back to QUBO form, a straightforward approach is to add a penalty term such as $\lambda \left (\left ({\sum }_{i}x_{i}\right )-k\right )^{2}$ to Q(x,α), which is only equal to zero for a selection of exactly k features. Here, λ represents a strictly positive factor called Lagrangian. The problem can then be solved to obtain the desired solution such that k = ∥x^∗∥₁.

However, two challenges arise from this approach. First, a suitable choice of λ is not clear and depends on the magnitude of Q(x,α). If it is chosen too small, the imposed constraint might be ignored for certain solutions. On the other hand, if it is chosen too large, it may lead to a very large value range and, consequently, to loss of precision. Summarized, having both very small and very large elements in the QUBO matrix limits the amount of scaling to amplify meaningful differences between loss values. Furthermore, it is not clear in the fist place whether a feasible solution to the constrained problem can be found at all.

Due to these difficulties, we propose an alternative strategy to specify the number of selected features. Instead of resorting to penalty terms, we use the fact that the choice of α itself can be used the number of features present in the solution. This presumption can be motivated by considering the extremal values of α ∈ [0,1]. If we set α = 0, all diagonal entries of Q(0) become 0 and we put full emphasis on redundancy. Trivially, both the empty set of features and any single feature is least redundant, so that the optimally selected set of features is either {∅} or {{i}|i ∈ [n]}. Conversely, if α = 1, the problem becomes linear with only negative coefficients, leading to a selection of all features as the optimal solution. Consequently, by iteratively varying α from 0 to 1 in sufficiently small steps, we observe that $\lVert \boldsymbol {x}^{*}\rVert _{1}$ increases monotonically in steps of one from 0 to n. This observation suggests that we can tune α to obtain a subset of any desired size $k\in \lbrace 0,1,\dots ,n\rbrace $. As we show with Proposition 1, this assumption is indeed true.

Proposition 1

For all Q(⋅,α) defined as in Eq. 13 and $k\in \lbrace 0,1,\dots ,n\rbrace $, there is an α ∈ [0,1] such that $\boldsymbol {x}^{*}\in \arg \min \limits _{\boldsymbol {x}}Q(\boldsymbol {x},\alpha )$ and ${\|\boldsymbol {x}^{*}\|}_{1}=k$.

The proof can be found in Appendix A. This result shows that we do not need additional constraints on the QUBO instance to control the number of features present in the global optimum.

2.3 QFS algorithm

From the result of Proposition 1, we can devise an algorithm that, if provided with R, I and k, returns an α^∗ such that an optimal feature subset vector x ∈ Q^∗(α^∗) has exactly k non-zero entries. For this purpose, we introduce

$$ Q^{*}_{k}(\alpha) := \underset{\underset{\text{s. t.} ~{\|\boldsymbol{x}\|}_{1}=k}{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}}{\min} Q(\boldsymbol{x},\alpha) $$

(16)

with 0 ≤ k ≤ n, i.e., the minimal function value of Q(x,α), Eq. 13, for a given α when the number of ones in the solution is restricted to k. Furthermore, we denote the minimum with respect to k (i.e., the global minimum) by

$$ Q^{*}(\alpha) := \min_{k} Q^{*}_{k}(\alpha). $$

(17)

Assuming we have an oracle for Q^∗(α) that, given any Q(⋅,α), returns a global optimum, an appropriate value for α^∗ can be found in $\mathcal {O}(\log n)$ steps through binary search. The value of α^∗ is not necessarily unique.

In addition to the procedure described above, we introduce a threshold 𝜖 ≥ 0, such that Q_ii(α), Eq. 14, is set to some small positive value μ > 0 if αI_i < 𝜖 for all i ∈ [n]. That is, we perform the transition Q_ij(α)↦Q_ij(α,𝜖,μ) with

$$ Q_{ij}(\alpha,\epsilon, \mu) = \begin{cases} \mu & \text{if }i=j \land \alpha I_{i}<\epsilon \\ Q_{ij}(\alpha) & \text{otherwise} \end{cases} $$

(18)

for arbitrary but constant values of 𝜖 and μ. In our experiments, we observed that, when the importance value of feature i is very close to zero, it has virtually no influence on the function value, and the solvers we used tended to include or exclude it randomly. As we seek to minimize the number of necessary features, we add an artificial weight μ to exclude these features from the optimal solution and avoid randomness. The exact value of μ is not decisive as long as it is positive, which ensures that the respective feature cannot be part of an optimal solution (Glover et al. 2018, Lemma 1.0).

The proposed algorithm is sketched in Algorithm 1. This algorithm is the main contribution of this manuscript.

3 Solving QUBOs

An integral part of our proposed QFS algorithm, Algorithm 1, is the solution of QUBOs, Eq. 1. QUBOs of the form

$$ \underset{\boldsymbol{x}\in\lbrace 0,1\rbrace^{n}}{\min} \boldsymbol{x}^{\intercal} \boldsymbol{Q} \boldsymbol{x} $$

(19)

with symmetrical $\boldsymbol {Q} \in \mathbb {R}^{n \times n}$ as in Eq. 13 are a popular class of optimization problems, known to be NP-hard (Pardalos and Jha 1992). Numerous practical optimization problems have been embedded into QUBO form, ranging from finance and economics (Laughhunn 1970; Hammer and Shlifer 1971) over satisfiability (Kochenberger et al. 2005) to ML tasks such as clustering (Kumar et al. 2018; Bauckhage et al. 2018), vector quantization (Bauckhage et al. 2020), support vector machines (Mücke et al. 2019; Date et al. 2020), and probabilistic graphical models (Mücke et al. 2019), to name a few.

3.1 Classical solvers

Motivated by its relevance for practical problems, a wide range of classical methods to solve QUBOs have been developed, both exactly and approximately. A comprehensive overview of applications and solution methods can be found, e.g., in Kochenberger et al. (2014). Notable among the heuristic approaches are evolutionary algorithms and simulated annealing, for which highly efficient special-purpose hardware has been developed (Mücke et al. 2019; Matsubara et al. 2020) that can quickly find good approximate solutions to QUBO instances with several thousand variables. Another heuristic solver is qbsolv (Booth et al. 2017), which finds approximate solutions of a QUBO instance by iteratively partitioning it into smaller sub-problems, which are solved through a variation of tabu search (Glover and Laguna 1998). The sub-problems are created by “clamping” certain bits of a current solution, i.e., treating them as fixed and only optimizing over the remaining bits. The bits to be optimized are selected according to their impact on the objective function value, i.e., its increase when negating them in the current best solution.

3.2 Quantum solvers

In recent years, quantum computing has opened up a promising approach to solving QUBO instances, which — among its many other applications — makes it interesting for performing FS. For this purpose, any QUBO problem can be encoded in form of a Hamiltonian

$$ \hat{H}={\sum}_{\underset{i \neq j}{i,j=0}}^{n} a_{ij}\hat{\sigma}_{i}\hat{\sigma}_{j} + {\sum}_{i=0}^{n} b_{i}\hat{\sigma}_{i} + c, $$

(20)

where $\hat {\sigma }_{i}$ denotes a Pauli-z matrix acting on qubit i with eigenvalues ± 1 and corresponding eigenstates |±_i〉. The coefficients $a_{ij},b_{i},c\in \mathbb {R}$ can be found by performing the transformation $x_{i} \mapsto (1-\hat {\sigma }_{i})/2$ of the objective function in Eq. 19 with $\hat {\sigma }_{i}^{2}=1$. Since the Pauli spin matrices of different qubits commute, the minimum eigenstate of the Hamiltonian |Ψ〉 with $\hat {H} |{\Psi }\rangle = E |{\Psi }\rangle $ can be written in terms of |Ψ〉 = |ψ₁〉⊗⋯ ⊗|ψ_n〉, where |ψ_i〉∈{|+_i〉,|−_i〉}∀i ∈ [1,n]. Therefore, the eigenvalue $E = \min \limits _{\boldsymbol {x}\in \lbrace 0,1\rbrace ^{n}} \boldsymbol {x}^{\intercal } \boldsymbol {Q} \boldsymbol {x}$ represents the minimum objective function value of the QUBO. The corresponding solution vector $\boldsymbol {x}^{*} = \arg \min \limits _{\boldsymbol {x}\in \lbrace 0,1\rbrace ^{n}} \boldsymbol {x}^{\intercal } \boldsymbol {Q} \boldsymbol {x} $ can be obtained from the eigenstate |Ψ〉 with the assignment $x_{i}^{*} = \vert \langle {-_{i}|{\Psi }\rangle }\vert ^{2}$ based on a projective measurement of each qubit (and is not necessarily unique). Summarized, the QUBO problem can be transformed into the problem of finding the minimum eigenstate (or ground state) |Ψ〉 of $\hat {H}$.

However, finding the ground state of a Hamiltonian is in general also a non-trivial problem, and possible solution strategies depend on the properties of the quantum hardware and the shape of $\hat {H}$. Two common approaches are the variational quantum eigensolver (VQE) (Peruzzo et al. 2014), which is suitable for quantum gate computers, and quantum annealing (QA) (Kadowaki and Nishimori 1998; Morita and Nishimori 2008), which is suitable for quantum annealers. VQE uses a hybrid quantum-classical computational approach (McClean et al. 2016) to minimize the expected energy $\langle {\Psi (\boldsymbol {\theta })}|\hat {H}|{\Psi (\boldsymbol {\theta })}\rangle $. For this purpose, a parametric ansatz $\hat {U}(\boldsymbol {\theta })$ is prepared on a quantum gate computer such that $|{\Psi (\boldsymbol {\theta })}\rangle = \hat {U}(\boldsymbol {\theta })|{0}\rangle $. The circuit parameters 𝜃 are learned with a classical optimization in order to find an estimate for the ground state |Ψ〉≈|Ψ(𝜃)〉. In contrast, QA exploits the adiabatic theorem (Farhi et al. 2000) by preparing the ground state |Ψ₀〉 of a simple mixing Hamiltonian $\hat {H}_{0}$ and then slowly transferring the system into the ground state |Ψ〉 of the target Hamiltonian $\hat {H}$ through adiabatic time evolution (Gruber 1999). A hybrid approach can be realized by splitting the initial QUBO into smaller sub-problems and solving each with QA, e.g., by using the qbsolv strategy explained above.

Since both algorithms are heuristic, typically multiple samples (or “shots”) of the solution are obtained from repeated measurements.

4 Related research

There are numerous approaches of defining optimal feature subsets and finding or approximating them (Chandrashekar and Sahin 2014).

Wrapper methods directly use the performance of classification or regression models as a criterion for selecting features (John et al. 1994). As the model must be re-trained on every candidate subset, this method is very resource-intensive. Finding the optimal subset requires brute-forcing all 2ⁿ possible subsets, which is generally intractable for large feature sets and non-trivial models. Instead, heuristic optimization schemes are often used, such as greedy search or evolutionary algorithms (Leardi et al. 1992; Siedlecki and Sklansky 1993).

Filter methods use a measure of relevance, e.g., correlation or MI with the label or target variable, for ranking features and discarding those below a certain threshold. While those methods are very easy to compute and work well in practice, they often do not consider redundancy between selected features, leading to subsets that could potentially be much smaller. When considering redundancy, such as pairwise MI between selected features, as an additional criterion to be minimized, the optimization problem becomes non-linear and NP-hard.

In Rodriguez-Lujan et al. (2010), this problem is formulated as a quadratic programming (QP) task, which is solved approximately by means of dimension reduction. The resulting solution is a real-valued weight vector that is used for ranking the features. This last step is purely heuristic, as the QP task is merely a relaxation of the corresponding QUBO problem with binary weight variables which we discuss in this article. QUBOs become prospectively tractable through quantum computation, which is why we do not resort to approximations or relaxations in our approach to make our method feasible.

QUBO formulation based on redundancy and importance appeared recently in literature (Tanahashi et al. 2018; Otgonbaatar and Datcu 2021); however, the authors rely on a penalty term with Lagrange multiplier λ for restricting the solution to k features. The choice of λ is not trivial and can drastically increase the dynamic range of the QUBO coefficients if chosen too large, as discussed above. In our approach, we observe that we can weigh redundancy and importance against each other in order to control the number of features present in the optimal solution, rendering any additional constraints obsolete.

Another approach relies on quantum gate computing to apply Grover’s algorithm to an oracle that yields the improvement in accuracy by adding or removing single features (He et al. 2018). This is merely a quantum version of a simple sequential wrapper approach, which improves the theoretical runtime for greedily selecting the next feature to be inserted or removed, as Grover’s algorithm finds the minimum or maximum of a function with logarithmic time complexity. The solution quality is the same as for a classical greedy algorithm — or potentially worse, as Grover’s algorithm is probabilistic.

5 Experiments

In Section 2, we have presented our novel QFS algorithm, Algorithm 1. The current section contains a study of four different experiments to evaluate the performance of QFS.

For this purpose, we use six data sets, both synthetic and taken from real-world data sources. All data sets are listed in Table 1, and a detailed description can be found in Appendix B. For discretizing the data as described in Section 2.1, we choose B = 20. Furthermore, we set 𝜖 = 10^− 8 and $\mu =\max \limits _{i,j\in [n]}Q_{ij}(\alpha )$ in Algorithm 1. These parameters are determined empirically by careful testing.

Table 1 Data sets used for our numerical experiments. For each data set, we list the number of features n, the number of classes c, and the number of samples S. A detailed description can be found in Appendix B

Full size table

Our first experiment in Section 5.1 serves as a proof of principle, in which we evaluate whether QFS is able to find any useful features at all. This is realized by selecting 30 features from the mnist data set through QFS and training separate 1-vs-all classifiers (on all digits). To evaluate the usefulness of the selected features, we compare the performance of these classifiers to those trained on 30 random features and all available features, respectively. The second experiment in Section 5.2 is a wider empirical comparison of various combinations of commonly used FS methods and ML models with the goal to show that QFS is competitive. In Section 5.3, we apply QFS in a more application-oriented setting by using the selected features as a means of lossy data compression, interpreting the reduced feature space as a latent representation and training a convolutional neural network to reconstruct the original features. Finally, we use quantum hardware in Section 5.4 to solve two exemplary feature selection QUBO instances for QFS. This experiment demonstrates that our method can indeed be used with current NISQ devices.

5.1 Experiment 1: feature quality

The first experiment serves as a proof of principle, in which we verify that QFS is able to find informative features (Fig. 2).

5.1.1 Setup

For this purpose, we consider the mnist data set and run Algorithm 1 ten times with k = 30 such that we use about 3.8% of the original 784 features (i.e., pixels). To solve the QUBO instances classically, we use qbsolv (Booth et al. 2017). As redundancy, we use the pairwise MI matrix over all features, Eq. 4. As importance for digit d, we calculate the MI between the features x_i and a binary variable , Eq. 3. This yields ten feature selections, each tailored towards one digit, which we use to perform a 1-vs-all classification. To quantify whether the selected features from our method are informative or not, we train a Random Forest classifier on these features and determine its accuracy. For comparison, we also determine the accuracy of a Random Forest classifier that has been trained on the whole feature set and a set of randomly selected features (uniformly sampled from the set of k-element subsets of [n] for each digit), respectively. Specifically, the Random Forest is composed of 100 estimators, each a Decision Tree of maximal depth five and a maximum of five features considered when searching the best split. These restrictions serve to limit the model size, as is a common objective in applications which require FS. We use the Python implementation provided by scikit-learn (Pedregosa et al. 2011).

5.1.2 Results

The selected features (pixels) per digit are shown in Section 5.1.2. We perform 10-fold cross validation and report mean and standard deviation of the classification accuracy, which is visualized in Fig. 3. For “Random subsets,” we report the cross-validated accuracy averaged over five random subsets, so mean and standard deviation are reported over 50 runs in total.

We observe that the Random Forest model using the constrained estimators achieves the best results on all digits using the optimal features found through QFS. This confirms that our method indeed finds informative features that are useful for classification. Interestingly, the models trained on the QFS subsets not only outperform the models trained on random subsets, but also those using all features. This is probably due to the restricted number of features per split in the base estimators, which leads to a higher chance of picking non-informative features when using all available pixel values.

5.2 Experiment 2: cross-model comparison with FS methods

In the second experiment, we contrast our method of QFS to other feature selection methods by comparing the accuracy of various classification models trained on the respective feature subsets in analogy to the previous experiment from Section 5.1.

5.2.1 Setup

Specifically, we apply Algorithm 1 to five different data sets to obtain feature subsets of fixed sizes k. We consider 30 features (3.8%) for mnist, five features (14.7%) for ionosphere, 20 features (4%) for madelon, 20 features (5%) for synth_100, and 5 features (23.8%) for waveform. For madelon and synth_100, we use the known number of informative features, while for the remaining data sets we chose the feature subset sizes arbitrarily in varying percentage ranges of the total number of features.

To evaluate the quality of the selected features, we compare QFS to three other heuristic FS methods:

1.
Ranking obtained from the Euclidean norm over the coefficients of n 1-vs-all Logistic Regression (LR) models.
2.
Ranking obtained from the impurity-based feature importances given by an Extra Trees Classifier (ET) with 100 estimators.
3.
Recursive Feature Elimination (RFE) (Guyon et al. 2002) performed on a Decision Tree of maximal depth 10.

The resulting feature subsets are then used to train five classification models:

1.
Neural Network
2.
1-vs-all Logistic Regression
3.
Decision Tree
4.
Random Forest (of 100 decision trees)
5.
Naive Bayes classifier (with Gaussian prior)

The Neural Network has a single hidden layer containing $\lfloor \sqrt {k}+0.5\rfloor $ neurons to make the dependency between the number of parameters and selected features k linear. For the activation function, we use ReLU. Both the Neural Network and the Logistic Regression are limited to 1000 learning iterations. Again, we use Python implementations provided by scikit-learn (Pedregosa et al. 2011) for all models and FS methods.

On every model, we perform a 10-fold cross validation and report mean and standard deviation of the classification accuracy.

5.2.2 Results

The results are visualized in Fig. 4. The results of this experiment show that QFS, in general, compares favorably among FS methods. The RFE method using decision trees often leads to better accuracies, but comes at the cost of much higher computation time, owing to the fact that it is a wrapper method which requires the model to be re-trained in every iteration.

For the data sets madelon and synth_100, we know the ground-truth informative features, which allows us to evaluate the distance between the optimal feature subset and the subset found by each method. To this end, we use the edit distance between pairs of feature subsets, which is the number of features that need to be swapped in order to turn one subset into the other. Alternatively, it is the Hamming distance between the binary feature indicator vectors, divided by two. We represent each FS method used in this experiment as a node in an undirected graph and each edit distance between the feature vectors they produced as a weighted edge. Nodes of distance 0 are represented as cluster nodes. The resulting graphs are shown in Fig. 5.

QFS is able to find all informative features for synth_100, and is closest to ground truth on madelon among all other methods except for the Extra Tree classifier ranking, which is able to find the ground-truth features in both cases. This result shows that MI, when used as measure of redundancy and importance in QFS, produces useful feature subsets on both subsets where ground truth is known. Moreover, FS on these data sets seems to work very well with the Extra Tree classifier ranking method, which indicates that impurity-based measures are particularly effective here.

5.3 Experiment 3: application: lossy compression with autoencoder

The previous experiments have confirmed that QFS picks features of a data set which are important and carry little redundancy according to our criteria of choice. From a different perspective, this can be interpreted as removing unimportant and redundant features, which leads to much smaller data sets. Consequently, QFS can be used as a type of lossy compression by computing an optimal feature subset S and discarding all features i∉S. From this compressed representation, we can reconstruct the original data points approximately by means of some ML model (Kingma and Welling 2013). This process is shown schematically in Fig. 6. In this third experiment, we evaluate the lossy compression empirically.

5.3.1 Setup

To realize this experiment, we use Algorithm 1 to perform QFS on the mnist data set with k = 25 (i.e., 3.19% of all features). The resulting subset S of pixel positions is interpreted as a latent space, i.e., a compressed data representation like one we would obtain from principal component analysis (PCA) or an autoencoder. While PCA is based on an eigendecomposition that is used to project data onto axes of maximal variance, the subspace S induced by the feature selection fulfills other optimality criteria based on redundancy and importance.

By feeding this latent representation into a convolutional neural network (CNN), we can reconstruct the original images by projecting back to a size of 28 × 28 and minimizing the difference between reconstruction and original. To this end, our CNN architecture consists of a linear input layer with k inputs and 392 outputs. The output is reshaped to 8 channels of size 7 × 7. Using two sequential 2D transposed convolution operations interspersed with ReLU activations, the images are first inflated to 16 × 14 × 14 and finally to 1 × 28 × 28. A sigmoid is applied to the output in order to obtain pixel values between 0 and 1. The model weights are trained by minimizing the mean squared error (MSE) between the original samples x and the reconstructions,

$$ \frac{1}{784}\|{\boldsymbol{x}-f_{\boldsymbol{\theta}}(\boldsymbol{x}_{S})}\|_{2}^{2} ~\rightarrow ~\min_{\boldsymbol{\theta}}, $$

(21)

where f_𝜃 is the model function with weights 𝜃, and x_S the sub-vector of x containing only the features in S found through QFS. We train the model for 1000 epochs with batches of size 250, using the Adam optimizer (Kingma and Ba 2014) from PyTorch (Paszke et al. 2019) with a 1cycle learning rate scheduler (Smith and Topin 2018) with maximum learning rate 0.01.

5.3.2 Results

The CNN achieves a MSE of 21.7389, which corresponds to an average squared pixel deviation of 0.0277. Figure 7 shows 20 random mnist samples on the left, and their respective reconstructions using the procedure described above.

The reconstructed samples are visually very similar to the originals, which suggests that the pixel positions found through QFS contain useful information about the samples that help to deduce the values of nearby pixels.

5.4 Experiment 4: QFS on quantum hardware

So far, we have only used classical QUBO solvers to perform QFS. In this final experiment, we consider QFS with actual quantum hardware as a QUBO solver, using both a quantum annealer and a quantum gate computer.

5.4.1 Setup

Based on QFS, we construct QUBO instances for three data sets, ionosphere, waveform, and synth_10 as before, but solve them using QA on a D-Wave quantum annealer and VQE on an IBM gate quantum computer as described in Section 3.2.

To conduct this experiment, we first perform an entirely classical run of Algorithm 1 to obtain values for α for a predefined number of selected features k. In principle, this search for α can be done on quantum hardware as well. However, we can reduce the quantum computation time significantly by pre-computing α classically. This is crucial since access of the quantum gate hardware is time-consuming, especially for the IBM gate quantum computer. For this reason, we also only consider the synth_10 data set for the VQE algorithm. The resulting classically obtained values of α are listed in Table 2 together with the chosen number of selected features k. Using these values, we assemble a QUBO instance for each data set and solve these QUBO instances using both hardware approaches.

Table 2 Experiment 4: For each of the three data sets of interest, we list the number of features n, the chosen number of selected features k, and the resulting value of α from a classical evaluation of Algorithm 1 used for the quantum experiment

Full size table

The D-Wave quantum annealer Advantage 5.1 is accessed via D-Wave’s cloud service Leap^{Footnote 1}. It operates on 5627 qubits (Update 2021). We use the QA implementation provided by ocean^{Footnote 2} with the DWaveSampler and default parameters. We evaluate the QUBO instances for all three data sets. In total, we obtain 1024 samples per QUBO instance, each representing an estimate of the solution. Additionally, we perform the same experiment using simulated annealing, again using D-Wave’s Python implementation contained in ocean with default parameters.

The IBM quantum gate computer ibmq_ehningen, on the other hand, operates on a Falcon r5.11 processor with 27 qubits^{Footnote 3}. It is accessible via IBM’s cloud service IBM Quantum^{Footnote 4}. We use Qiskit Runtime’s default VQE implementation (ANIS et al. 2021) with the simultaneous perturbation stochastic approximation (SPSA) optimizer (Spall 1998). We let the optimizer run for 32 iterations with Qiskit’s default parameters. As our parametric ansatz, we chose a Pauli two-design (Nakata et al. 2017) with four layers. Since the computations on the quantum gate hardware are time-consuming, we only evaluate the QUBO instances for one of the three data sets, the synth_10 data set. Again, we obtain 1024 samples from the resulting circuit.

5.4.2 Results

The results for QA are shown in Fig. 8. On the left, we show all 1024 samples sorted in ascending order of energy, such that the lowest measured energies are on the left. Mean and standard deviation are reported over the sorted sequences of 16 runs. The globally optimal energies, which we found by brute force, are shown as horizontal lines. On the right, we show for each solution bit (corresponding to feature indices for QFS) the number of times the corresponding bit was measured as 1 across all shots. Again, we report mean and standard deviation over 16 runs. The color of each bar indicates whether the corresponding global optimum of the respective bit is 0 or 1. The sequence of optimal bits corresponds to the optimal feature selection for our application. Figure 9 is analogous to Fig. 8, showing the results obtained through simulated annealing.

The histograms show that there is a clear correspondence between feature optimality (bar color) and the number of occurrences, which indicates that QA is able to find the global optimum in a certain fraction of samples. With increasing number of qubits, this correlation gets noticeably less pronounced. To be precise, the optimum was found in 10.78 ± 5.12% for synth_10, 0.18 ± 0.18% for waveform, and only a single time across all 16 runs for ionosphere. In contrast, simulated annealing finds the correct bits with higher probability, even for data of higher dimension: The optima are found in 100.00 ± 0.00% of shots for synth_10, 20.39 ± 1.40% for waveform and 21.04 ± 1.02% for ionosphere. This result indicates that the use of NISQ hardware compromises the solution quality, possibly due to loss of precision when loading the QUBO parameters onto the quantum annealer, or additional noise introduced by read-out errors.

The result for VQE is shown in Fig. 10 in analogy to Fig. 8. Mean and standard deviation are obtained from 5 runs. We find no clear correspondence between globally optimal bits and number of occurrences. The global optimum was only found in 0.14 ± 0.08% of measurements. We assume that hardware noise, as well as the low number of optimization steps that were used due to long run times, lead to this performance. In particular, we expect that VQE could perform better for longer run times and less noisy hardware. It is important to understand that in contrast to the D-Wave results, where each shot represents one approximation run to find the underlying QFS optimum, the 1024 VQE samples were all taken from one optimized circuit.

In summary, we find that near-term quantum devices can in principle be used for QFS, especially for low data dimensions and on special-purpose hardware like quantum annealers that are designed to solve QUBO problems.

6 Conclusion

In this article, we present a novel algorithm for performing feature selection based on a generalized QUBO embedding, which can be solved on both classical and quantum hardware. To this end, we use mutual information as a basis for measures of importance and redundancy, which we balance using an interpolation factor α. We show theoretically and empirically that our method allows the selection of feature subsets of any desired size k without resorting to constraints on the solution space.

To demonstrate our framework’s effectiveness, we have performed a range of experiments, comparing different common features selection methods and the resulting performance on different ML models. Furthermore, we have also realized a practical application for lossy data compression.

One of our experiments has been run on actual quantum hardware, which further demonstrates that our algorithm is viable and NISQ-compatible. Our experiments are conducted on rather low-dimensional problems; however, this experimental setup is dictated by the currently available hardware, and we expect our algorithm to scale in accordance with future quantum computing developments.

Our framework can easily be modified by changing measures of importance and redundancy. Other choices instead of MI include entropy, Pearson’s correlation, or other information-theoretic measures. Even expert knowledge can be incorporated, if available. As Proposition 1 is valid for general I and R with non-negative entries, the proof holds for any combination of importance and redundancy measures, and Algorithm 1 can be applied accordingly.

Data availability

The data that support the findings of this study are available in https://github.com/Castle-Machine-Learning/feature-selection-data.

Notes

References

ANIS MS, Abraham H, Aduoffei, et al. (2021) Qiskit: An open-source framework for quantum computing. https://doi.org/10.5281/zenodo.2573505
Bauckhage C, Ojeda C, Sifa R et al (2018) Adiabatic quantum computing for kernel k = 2 means clustering. In: LWDA, pp 21–32
Bauckhage C, Ramamurthy R, Sifa R (2020) Hopfield networks for vector quantization. In: Farkaš I, Masulli P, Wermter S (eds) Artificial neural networks and machine learning (ICANN). Springer International Publishing, pp 192–203, https://doi.org/10.1007/978-3-030-61616-8_16
Booth M, Reinhardt S, Roy A (2017) Partitioning optimization problems for hybrid classical/quantum execution. Technical Report, https://docs.ocean.dwavesys.com/projects/qbsolv/en/latest/_downloads/bd15a2d8f32e587e9e5997ce9d5512cc/qbsolv_techReport.pdf
Breiman L, Friedman J, Olshen R et al (1984) Classification and regression trees. Cole Statistics/Probability Series
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Computers & Electrical Engineering 40(1):16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Article Google Scholar
Date P, Arthur D, Pusey-Nazzaro L (2020) QUBO Formulations for training machine learning models. arXiv:200802369
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
Farhi E, Goldstone J, Gutmann S et al (2000) Quantum computation by adiabatic evolution. arXiv:quant-ph/0001106
Glover F, Laguna M (1998) Tabu search. In: Handbook of combinatorial optimization. Springer, pp 2093–2229
Glover F, Lewis M, Kochenberger G (2018) Logical and inequality implications for reducing the size and difficulty of quadratic unconstrained binary optimization problems. Eur J Oper Res 265(3):829–842. https://doi.org/10.1016/j.ejor.2017.08.025
Article MathSciNet MATH Google Scholar
Gruber C (1999) Thermodynamics of systems with internal adiabatic constraints: time evolution of the adiabatic piston. Eur J Phys 20(4):259
Article MATH Google Scholar
Guyon I, Weston J, Barnhill S, et al. (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422
Article MATH Google Scholar
Guyon I, Gunn S, Ben-Hur A, et al. (2004) Result analysis of the nips 2003 feature selection challenge. Advances in Neural Information Processing Systems, 17
Hammer PL, Shlifer E (1971) Applications of pseudo-Boolean methods to economic problems. Theory and decision 1(3):296–308
Article MATH Google Scholar
He Z, Li L, Huang Z, et al. (2018) Quantum-enhanced feature selection with forward selection and backward elimination. Quantum Inf Process 17(7):154. https://doi.org/10.1007/s11128-018-1924-8
Article MathSciNet MATH Google Scholar
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. Morgan Kaufmann, p 121–129, https://doi.org/10.1016/B978-1-55860-335-6.50023-4
Kadowaki T, Nishimori H (1998) Quantum annealing in the transverse Ising model. Phys Rev E 58(5):53–55
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv:13126114
Kochenberger G, Glover F, Alidaee B, et al. (2005) Using the unconstrained quadratic program to model and solve Max 2-SAT problems. International Journal of Operational Research 1(1-2):89–100
Article MathSciNet MATH Google Scholar
Kochenberger G, Hao JK, Glover F, et al. (2014) The unconstrained binary quadratic programming problem: a survey. J Comb Optim 28(1):58–81
Article MathSciNet MATH Google Scholar
Kumar V, Bass G, Tomlin C, et al. (2018) Quantum annealing for combinatorial clustering. Quantum Inf Process 17(2):39. https://doi.org/10.1007/s11128-017-1809-2
Article MathSciNet MATH Google Scholar
Laughhunn D (1970) Quadratic binary programming with application to capital-budgeting problems. Operations research 18(3):454–461
Article MATH Google Scholar
Leardi R, Boggia R, Terrile M (1992) Genetic algorithms as a strategy for feature selection. J Chemom 6(5):267–281. https://doi.org/10.1002/cem.1180060506
Article Google Scholar
LeCun Y, Cortes C (2010) MNIST handwritten digit database http://yann.lecun.com/exdb/mnist/
Lewandowski D, Kurowicka D, Joe H (2009) Generating random correlation matrices based on vines and extended onion method. J Multivar Anal 100(9):1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008
Article MathSciNet MATH Google Scholar
Mandros P, Kaltenpoth D, Boley M et al (2020) Discovering functional dependencies from mixed-type data. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp 1404–1414, https://doi.org/10.1145/3394486.3403193
Matsubara S, Takatsu M, Miyazawa T et al (2020) Digital annealer for high-speed solving of combinatorial optimization problems and its applications. In: Asia and South Pacific Design Automation Conference (ASP-DAC), pp 667–672, https://doi.org/10.1109/ASP-DAC47756.2020.9045100, iSSN: 2153-697X
McClean JR, Romero J, Babbush R, et al. (2016) The theory of variational hybrid quantum-classical algorithms. New Journal of Physics 18(2) 023:023. https://doi.org/10.1088/1367-2630/18/2/023023
MATH Google Scholar
Morita S, Nishimori H (2008) Mathematical foundation of quantum annealing. J Math Phys 49(12):125,210
Article MathSciNet MATH Google Scholar
Mücke S, Piatkowski N, Morik K (2019) Learning bit by bit: extracting the essence of machine learning. In: Jäschke R, Weidlich M (eds) Proceedings of the Conference on “Lernen, Wissen, Daten, Analysen” (LWDA), pp 144–155. http://ceur-ws.org/Vol-2454/paper_51.pdf
Nakata Y, Hirche C, Morgan C, et al. (2017) Unitary 2-designs from random X- and Z-diagonal unitaries. J Math Phys 58(5):052,203. https://doi.org/10.1063/1.4983266
Article MathSciNet MATH Google Scholar
Otgonbaatar S, Datcu M (2021) A quantum annealer for subset feature selection and the classification of hyperspectral images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:7057–7065
Article Google Scholar
Pardalos PM, Jha S (1992) Complexity of uniqueness and local search in quadratic 0–1 programming. Oper Res Lett 11(2):119–123
Article MathSciNet MATH Google Scholar
Paszke A, Gross S, Massa F et al (2019) PyTorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A et al (eds) Advances in neural information processing systems 32. Curran Associates, Inc., p 8024–8035, http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pedregosa F, Varoquaux G, Gramfort A, et al. (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Peruzzo A, McClean J, Shadbolt P, et al. (2014) A variational eigenvalue solver on a photonic quantum processor. Nat Commun 5(1):4213. https://doi.org/10.1038/ncomms5213
Article Google Scholar
Rodriguez-Lujan I, Elkan C, Santa Cruz C et al (2010) Quadratic programming feature selection. Journal of Machine Learning Research
Siedlecki W, Sklansky J (1993) A note on genetic algorithms for large-scale feature selection. In: Handbook of Pattern Recognition and Computer Vision. WORLD SCIENTIFIC, p 88–107,
Sigillito VG, Wing SP, Hutton LV, et al. (1989) Classification of radar returns from the ionosphere using neural networks. J Hopkins APL Tech Dig 10(3):262–266
Google Scholar
Smith LN, Topin N (2018) Super-convergence: very fast training of neural networks using large learning rates. arXiv:170807120
Spall JC (1998) An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins apl Technical digest 19(4):482–492
Google Scholar
Tanahashi K, Takayanagi S, Motohashi T et al (2018) Global mutual information based feature selection by quantum annealing
Update TASP (2021) Catherine mcgeoch and pau farré. Tech. rep., D-Wave
Van Der Maaten L, Postma E, Van den Herik J, et al. (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10(66-71):13
Google Scholar

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. Parts of this research have been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence. This work was developed with help of the Fraunhofer Cluster of Excellence “Cognitive Internet Technologies.” Parts of this work have been funded by the Ministry of Science and Health of the State of Rhineland-Palatinate (Germany) as part of the project AnQuC-3.

Author information

Authors and Affiliations

TU Dortmund, AI Group, Dortmund, 44227, Germany
Sascha Mücke
Fraunhofer FZML, ITWM, Kaiserslautern, 67663, Germany
Raoul Heese & Sabine Müller
Fraunhofer FZML, SCAI, Sankt Augustin, 53757, Germany
Moritz Wolter
Fraunhofer IAIS, Sankt Augustin, 53757, Germany
Nico Piatkowski

Authors

Sascha Mücke
View author publications
You can also search for this author in PubMed Google Scholar
Raoul Heese
View author publications
You can also search for this author in PubMed Google Scholar
Sabine Müller
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Wolter
View author publications
You can also search for this author in PubMed Google Scholar
Nico Piatkowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sascha Mücke.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix : 1. Proof

Here, we provide a proof of Proposition 1, which states that for all Q(⋅,α) defined as in Eq. 13 and $k\in \lbrace 0,\dots ,n\rbrace $, there is an α ∈ [0, 1] such that $\exists \boldsymbol {x}^{*}\in \mathbb {B}^{n}: ~\boldsymbol {x}^{*}\in \arg \min \limits _{\boldsymbol {x}}Q(\boldsymbol {x},\alpha )$ and ∥x∥₁ = k.

Proof

Recall that R(x) ≥ 0 and I(x) ≥ 0 for all $\boldsymbol {x}\in \mathbb {B}^{n}$ due to non-negativity of mutual information. Firstly, note that for α = 0 both the zero vector 0 and all unit vectors e_i are optimal w.r.t. Q(⋅, 0), as Q(0, 0) = 0 and ∀i ∈ [n] : Q(e_i, 0) = R_ii = 0 by definition of R. This covers the cases k = 0 and k = 1. Furthermore, if α = 1, we have $Q(\boldsymbol {x},1)=-I(\boldsymbol {x})=-{\sum }_{i\in [n]}I_{i}x_{i}$, which is trivially minimized by the one vector 1, covering the case k = n. Now, for all $k\in \lbrace 0,\dots ,n\rbrace $, consider the functions

$$ Q^{*}_{\leq k}(\alpha) := \min_{\boldsymbol{x}\in\mathbb{B}^{n}}Q_{\alpha}(\boldsymbol{x})~\text{s.t. } {\|\boldsymbol{x}\|}_{1}\leq k . $$

(22)

These functions are piece-wise linear and strictly decreasing in α (see Fig. 11), due to ∂Q_α(x)/∂α = −(R(x) + I(x)) ≤ 0 and non-negativity of R(x) and I(x). This implies further for all α ∈ [0, 1] and k ∈ [n] that

$$ \begin{array}{@{}rcl@{}} \underset{\underset{{\|\boldsymbol{x}\|}_{1}\leq k-1}{\boldsymbol{x}\in\mathbb{B}^{n}}}{\min} R(x) \leq \underset{\underset{{\|\boldsymbol{x}\|}_{1}\leq k}{\boldsymbol{x}\in\mathbb{B}^{n}}}{\min} R(x) \end{array} $$

(23)

$$ \begin{array}{@{}rcl@{}} \underset{\underset{{\|\boldsymbol{x}\|}_{1}\leq k-1}{\boldsymbol{x}\in\mathbb{B}^{n}}}{\max} I(x) \leq \underset{\underset{{\|\boldsymbol{x}\|}_{1}\leq k}{\boldsymbol{x}\in\mathbb{B}^{n}}}{\max} I(x) . \end{array} $$

(24)

This immediately implies that for any $k<k^{\prime }$

$$ \begin{array}{@{}rcl@{}} Q^{*}_{\leq k}(0) &\leq Q^{*}_{\leq k^{\prime}}(0) \end{array} $$

(25)

$$ \begin{array}{@{}rcl@{}} Q^{*}_{\leq k}(1) &\geq Q^{*}_{\leq k^{\prime}}(1) , \end{array} $$

(26)

which further implies that, unless $Q^{*}_{\leq k}$ and $Q^{*}_{\leq k^{\prime }}$ are equal, there is at least one point β such that for all $\alpha ^{\prime }>\beta $ we have $Q^{*}_{\leq k^{\prime }}(\alpha ^{\prime })\leq Q^{*}_{\leq k}(\alpha ^{\prime })$ as a consequence of $Q^{*}_{\leq k}$ and $Q^{*}_{\leq k^{\prime }}$ being non-increasing, from which follows the proof. If indeed $Q^{*}_{\leq k}$ and $Q^{*}_{\leq k^{\prime }}$ were equal, both binary vectors x and $\boldsymbol {x}^{\prime }$ with ∥x∥₁ = k and ${\|\boldsymbol {x}^{\prime }\|}_{1}=k^{\prime }$ would be optimal, from which the proof still follows.

□

Appendix : 2. Data sets

In the following, we will briefly describe each data set used in Section 5.

mnist (LeCun and Cortes 2010) contains 28 × 28 gray-scale images of handwritten digits. Each feature (i.e., each pixel) can take integer values from 0 (black) to 255 (white). We divided these values by 255 to rescale the features to the range [0, 1]. There are ten classes, one for each digit.
ionosphere (Sigillito et al. 1989; Dua and Graff 2017) contains measurements of electrons in the ionosphere captured by 16 antennae in north-east Canada. The resulting 34 features take values in the range [− 1, 1]. The binary label indicates the presence of evidence of certain structures in the ionosphere.
waveform is a synthetic data set first introduced in Breiman et al. (1984). It contains short time series of length 21, each containing a random linear combination of two of three triangular base waves with added Gaussian noise. The $\left (\begin {array}{cc}{3}\\ {2} \end {array}\right )=3$ combinations of two out of three base waves provide the class label.
madelon consists of 5-dimensional points sampled around the corners of a hypercube, with each corner randomly representing one of two classes. In addition, 15 linear combinations of these five features as well as 480 random irrelevant features (“probes”) without predictive power are included, leading to 500 features in total.
synth_10 is another synthetic data set with n = 10 features and a binary label that indicates if a linear combination of a subset of four specific features is above a fixed threshold. We generated the data set by first choosing $d_{\inf }=4$ indices of informative features $\mathcal {I}$ uniformly at random from [n] with n = 10. We then sampled two random correlation matrices $\boldsymbol {C}_{\inf }$ and C_rest with dimensions $d_{\inf }\times d_{\inf }$ and $d_{n-\inf }\times d_{n-\inf }$, respectively, using the algorithm under section 3.2 of Lewandowski et al. (2009) with β = 1. Next, we sample i.i.d. $\mu _{i}\sim \mathcal {N}(0,10)$ and $\sigma _{i}\sim \exp (\mathcal {N}(0,1))$ for i ∈ [n], such that $\boldsymbol {\mu }_{\inf }=(\mu _{i})_{i\in \mathcal {I}}$ and $\boldsymbol {\mu }_{\text {rest}}=(\mu _{j})_{j\in [n]\backslash \mathcal {I}}$, and $\boldsymbol {\sigma }_{\inf }$ and σ_rest respectively. From this, we obtain covariance matrices $\boldsymbol {\Sigma }_{\inf }=\boldsymbol {\sigma }_{\inf }\boldsymbol {\sigma }_{\inf }^{\top }\odot \boldsymbol {C}_{\inf }$ and Σ_rest analogously. Finally, we sample a data point x with $\boldsymbol {x}_{\mathcal {I}}\sim \mathcal {N}(\boldsymbol {\mu }_{\inf }, \boldsymbol {\Sigma }_{\inf })$ and $\boldsymbol {x}_{[n]\backslash \mathcal {I}}\sim \mathcal {N}(\boldsymbol {\mu }_{\text {rest}}, \boldsymbol {\Sigma }_{\text {rest}})$. We generate the labels by sampling $\boldsymbol {w}\in \mathbb {R}^{d_{\inf }}$ with $w_{i}\sim \mathcal {N}(0,1)$ i.i.d. for all $i\in [d_{\inf }]$. Then, for each data point x, we set the label to y = 0 if z := w^⊤x is below its mean value $\mathbb {E}_{\boldsymbol {x}}[z]$, and to y = 1 otherwise, which yields, in expectation, an equal class distribution.
synth_100 is generated using the same procedure as synth_10, but with n = 100 and $d_{\inf }=10$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mücke, S., Heese, R., Müller, S. et al. Feature selection on quantum computers. Quantum Mach. Intell. 5, 11 (2023). https://doi.org/10.1007/s42484-023-00099-z

Download citation

Received: 24 March 2022
Accepted: 15 January 2023
Published: 20 February 2023
DOI: https://doi.org/10.1007/s42484-023-00099-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Feature selection on quantum computers

Abstract

Similar content being viewed by others

Genetic algorithms: theory, genetic operators, solutions, and applications

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Artificial Intelligence in Physical Sciences: Symbolic Regression Trends and Perspectives

1 Introduction

2 Method

2.1 QUBO feature selection

2.2 Controlling the number of selected features

Proposition 1

2.3 QFS algorithm

3 Solving QUBOs

3.1 Classical solvers

3.2 Quantum solvers

4 Related research

5 Experiments

5.1 Experiment 1: feature quality

5.1.1 Setup

5.1.2 Results

5.2 Experiment 2: cross-model comparison with FS methods

5.2.1 Setup

5.2.2 Results

5.3 Experiment 3: application: lossy compression with autoencoder

5.3.1 Setup

5.3.2 Results

5.4 Experiment 4: QFS on quantum hardware

5.4.1 Setup

5.4.2 Results

6 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Appendices

Appendix : 1. Proof

Proof

Appendix : 2. Data sets

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation