Introduction

In many areas, such as text processing, biological information analysis, and combinatorial chemistry, data are often represented as high-dimensional feature vectors, but often only a small subset of features is necessary for subsequent learning and classification tasks. Thus, dimensionality reduction is preferred, which can be achieved by either feature selection or feature extraction (Guyon & Elisseeff 2003) to a low dimensional space. In contrast to feature extraction, feature selection aims at finding out the most representative or discriminative subset of the original feature spaces according to some criteria and maintains the original representation of features. During recent years, feature selection has attracted much research attention and widely used in a variety of applications (Yu et al. 2014; Ma et al. 2012b).

According to the availability of labels of training data, feature selection can be classified into supervised feature selection (Kira et al. 1992; Nie et al. 2008; Zhao et al. 2010) and unsupervised feature selection (He et al. 2005; Zhao & Liu 2007), (Yang et al. 2011; Peng et al. 2005). Supervised feature selection selects features according to label information of each training data. Unsupervised methods, however, are not able to obtain label information directly, and they frequently select the features which best preserve the data similarity or manifold structure of data.

Feature selection mainly focuses on search strategies and measurement criteria. The search strategies for feature selection can be divided into three categories: exhaustive search, sequential search, and random search. The exhaustive search aims to find out the optimal solution from all possible subsets. However, it is NP-hard and thus it is impractical to run. Sequential search methods, such as sequential forward selection and sequential backward elimination (Kohavi & John 1997), start from an empty set or the set of all candidates as the initial subset selected and successively add features to the selected feature or eliminate features from a subset one by one. The major drawback of the traditional sequential search methods relies heavily on search routes. Although the sequential methods do not guarantee the global optimality of selected subset, they have been widely used because of their simplicity and relatively low computational cost even for large-scale data. Plus-l-minus-r (l-r) (Devijver 1982), a slightly more reliable sequential search method, considers deleting features that were previously selected and selecting features that were previously deleted. However, it only partially solves the limit of search routes and brings in additional parameters. The random search methods, such as the random hill climbing and its extension sequential floating search (Jain & Zongker 1997), take advantage of randomized steps of the search and select features from all candidates with a chance probability per feature.

Measurement criterion is also an important research direction in feature selection. Data variance (Duda et al. 2001) ranks the score of each feature by the variance along a dimension. The measurement criterion of data variance finds features that are useful for representing data; however, these features may not be useful for preserving discriminative information. Laplacian score (He et al. 2005) is a recent locality graph-based unsupervised feature selection algorithm. Laplacian score reflects locality preserving power of each feature.

Recently, Wright et al. present a Sparse Representation-based Classification (SRC) (Wright et al. 2009) method. Afterwards, sparse representation-based feature extraction becomes an active direction. Qiao et al. (2010) present a Sparsity Preserving Projections (SPP) method, which aims to preserve the sparse reconstructive relationship of the data. Zhang et al. (2012) recently present a graph optimization for dimensionality reduction with sparsity constrains, which can be viewed as an extension of SPP. Clemmensen et al. (2011) provide a sparse linear discriminant analysis with a sparseness constraint on projection vectors.

As we know, feature selection with direct connection to SRC has not emerged. In this paper, we use SRC as a measurement criterion to design an unsupervised feature selection algorithm called sparsity preserving score (SPS). The formulated objective function, which is an essentially discrete optimization, aims to seek a binary linear transformation such that in a low-dimensional space the sparse representation coefficients are preserved. As the sparse representation is fixed, our theoretical analysis guarantees our objective function can be easily solved with a closed form, which is optimal solution. SPS simply ranks the score of each feature by Frobenius norm of sparse linear reconstruction residual in the space of selected features.

Background

Unsupervised feature selection criterion

Let x i  ∈ R m × 1 be the \( i \) th training sample and X = [x 1, x 2, …, x N ] ∈ R m × N be a matrix composed of entire training samples. The unsupervised criterion to select m ' (m ' < m) features is defined as

$$ { \min}_A\mathrm{loss}\left(X,X{U}^A\right)+\mu \varOmega \left({U}^A\right) $$

where A is the set of the indices of selected features, U A is the corresponding m × m-sized feature selection matrix, and XU A is reconstruction of the reduced space in \( {R}^{m^{\prime}\times N} \) to the original space in R m×N. loss(⋅) is the loss function, and μΩ(U A) is the regularization with μ as its parameter.

Sparse representation

Given a test sample y, we represent y in an overcomplete dictionary whose basis vectors are training sample themselves, i.e., y = . If the system of linear equation is underdetermined, this representation is naturally sparse. The sparsest solution can be sought by solving the following l 1 optimization problem (Donoho 2006; Cands et al. 2006):

$$ \widehat{\beta}= \arg { \min}_{\beta}\left|\right|\beta \left|\right|{}_1,s.t.,y=X\beta $$
(1)

This problem can be solved in polynomial time by standard linear programming algorithms (Chen et al. 2001).

Methods

We formulate our strategy to select n(n < m) features as follows: given a set of unlabeled training samples x i  ∈ R m × 1, i = 1,.., N, learn a feature selection matrix PR m×n such that P is optimal according to our objective function. For the task of feature selection, P is required to be a special 0–1 binary matrix which satisfies two constraints: (1) each row of P has one and only one non-zero entry of 1 and (2) each column of P has at most one non-zero entry. Accordingly, the sum of entries in each row equals 1 and the sum of entries in each column less than or equals 1. For test, \( {x}_i^{\hbox{'}}={U}^T{x}_i \) is the new representation of χ i where x '  i (k) = x i (k) if the kth feature is selected, and otherwise \( {x}_i^{\hbox{'}}(k)=0 \).

We define the following objective function to minimize the sparse linear reconstruction residual and measure the sparsity by the l 1 -×norm of coefficients.

$$ \begin{array}{ccc}\hfill { \min}_{P,\left\{{\beta}_i,i=1,\dots, N\right\}}\hfill & \hfill \begin{array}{l}J\left(P,{\beta}_i\right):=\\ {}s.t.,\end{array}\hfill & \hfill \begin{array}{l}{\displaystyle {\sum}_{i=1}^N\left|\right|P{x}_i-P{D}_i{\beta}_i\left|\right|{}_F^2+\lambda \left|\right|{\beta}_i\left|\right|{}_1}\\ {}{\displaystyle {\sum}_{j=1}^mP\left(i,j\right)=1}\\ {}{\displaystyle {\sum}_{i=1}^nP\left(i,j\right)\le 1,\kern0.5em }\\ {}P\left(i,j\right)=0\kern0.5em \mathrm{or}\;1\end{array}\hfill \end{array} $$
(2)

Here, D i  = [x 1, …, x i − 1, x i + 1, …, x N ] ∈ R m × (N − 1) is the collection of training samples without the ith sample, β i is the sparse representation coefficient vector of χ i over D i , and λ is a scalar parameter. The items in line 1 of (2) are approximation and sparse constraints in the features selected space, respectively. (2) is a joint optimization of P and β i (i = 1, …,N).

Since \( P \) and β i (i = 1,..,N) are dependent on each other, this problem cannot be solved directly. We update the variables alternately with others fixed.

By fixing β i (i = 1,..,N), removing terms irrelevant to P and rewriting the first term in (2) in a matrix form, the optimization problem (2) is reduced to

$$ \begin{array}{l}\ { \min}_P\mathrm{trace}\left\{P\varGamma {\varGamma}^T{P}^T\right\}\\ {}s.t.,\kern0.5em {\displaystyle {\sum}_{j=1}^mP\left(i,j\right)=1}\kern0.5em \\ {}{\displaystyle {\sum}_{i=1}^nP\left(i,j\right)\le 1}\kern0.5em \\ {}P\left(i,j\right)=0\kern0.5em \mathrm{or}\;1\end{array} $$
(3)

where Γ = [γ 1, …, γ N ], and γ i  = x i  − D i β i .

Under the constraints in (3), we suppose

P(i, k i ) = 1, then

$$ \begin{array}{cc}\hfill \mathrm{trace}\left\{P\varGamma {\varGamma}^T{P}^T\right\}\hfill & \hfill \begin{array}{l}={\displaystyle {\sum}_{i=1}^mP\left(i,:\right)\varGamma {\varGamma}^T{P}^T\left(i,:\right)}\\ {}={\displaystyle {\sum}_{i=1}^m\left\{P\left(i,:\right)\varGamma \right\}{\left\{P\left(i,:\right)\varGamma \right\}}^T\Big\}}\\ {}={\displaystyle {\sum}_{i=1}^m{\displaystyle {\sum}_{j=1}^N{\left\{\varGamma \left({k}_i,j\right)\right\}}^2}}\end{array}\hfill \end{array} $$
(4)

The optimization problem in (3) is converted into computing the sparsity preserving score of each feature, which is defined as

$$ \mathrm{Score}(i)={\displaystyle {\sum}_{j=1}^N{\left\{\varGamma \left({k}_i,j\right)\right\}}^2,i=1,\dots, m} $$
(5)

And then we rank and select the n smallest ones from Score(i), i = 1, …, m. Without loss of generality, suppose the n selected features are indexed by \( {k}_i^{*},i=1,\dots, n \). We can construct the matrix P as

$$ P\left(i,j\right)=\left\{\begin{array}{c}\hfill 1,\kern2.25em j={k}_i^{*}\hfill \\ {}\hfill 0,\ \mathrm{otherwise}\hfill \end{array}\right. $$
(6)

By fixing P, removing terms irrelevant to β i (i = 1, …,N), the optimization problem (3) is reduced to the following l 1 optimization problem

$$ { \min}_{\left\{{\beta}_i,i=1,\dots, N\right\}}{\displaystyle {\sum}_{i=1}^N\left|\right|P{x}_i-P{D}_i{\beta}_i\left|\right|{}_F^2+\lambda \left|\right|{\beta}_i\left|\right|{}_1} $$
(7)

The iterative procedure is given in Algorithm 1. The initial solution of β i can be calculated directly in the original space of selected features, and it can be used as a good initial solution of the iterative algorithm (Yang et al. 2013).

Note that since the P obtained via the first iteration is 0–1 matrix, some values of features (corresponding to \( j\ne {k}_i^{*} \)) are equal to zero in the second iteration. Thus, it is meaningless to compute the coefficient vector β i for features whose values are equal to zero. In other words, P becomes a stable value after the first iteration. Thus, we give non-iterative version of Algorithm 1, i.e., Algorithm 2, where we compute β i in the original space as

$$ { \min}_{\beta_i}\left|\right|{x}_i-{D}_i{\beta}_i\left|\right|{}_F^2+\lambda \left|\right|{\beta}_i\left|\right|{}_1 $$
(8)

Some standard convex optimization techniques or TNIPM in (Kim et al. 2007) can be used to solve β i . In our experiments, we directly use source code provided by authors in (Kim et al. 2007).

Algorithm 1: Iterative procedure for sparsity preserving score

Algorithm 2: Non-iterative procedure for sparsity preserving score

Results and discussion

Several experiments on Yale and ORL face datasets are carried out to demonstrate the efficiency and effectiveness of our algorithm. In our experiments, all samples are not pre-processed. Our algorithm is an unsupervised method, and thus, we compare our Algorithm 2 with other four representative unsupervised feature selection algorithms including data variance, Laplacian score, feature selection for multi-cluster data (MCFS) (Cai et al. 2010), and spectral feature selection (SPEC) (Zhao & Liu 2007) with all the eigenvectors of the graph Laplacian. In all the tests, the number of the nearest neighbors in Laplacian score, MCFS, and SPEC is taken to be half of the number of training images per person.

For both datasets, we choose the first five and six images, respectively, per person for training and the rest for testing. After feature selection, the recognition is performed by the “L2”-distance based 1-nearest neighbor classifier. Table 1 reports the top performance as well as the corresponding number of features selected, and Fig. 1 illustrates the recognition rate as a function of the number of features selected. As shown in Table 1, our algorithm reaches the highest or comparable recognition rate at the lowest dimension of feature selected space. From Fig. 1, we can see that with only a very small number of features, SPS can achieve significant better recognition rates than the other methods. It can be interpreted from two aspects: (1) SPS jointly selects features and obtain the optimal solution of a binary transformation matrix, while the other methods only add features one by one. Thus, SPS considers the interaction and dependency among features. (2) Features selected with sparse reconstructive relationship preserving are capable of enhancing recognition performance.

Table 1 The comparison of the top recognition rates and the corresponding number of features selected
Fig. 1
figure 1

Recognition results of the feature selection methods with respect to the number of selected features on (a-1, a-2) Yale and (b-1, b-2) ORL

We randomly choose five and six images, respectively, per person for training and the rest for testing. Since the training set is randomly chosen, we repeat this experiment ten times and calculate the average result. The average top performances obtained are reported in Table 2. The results further verify that SPS can select more informative preserving feature subset.

Table 2 The comparison of average top recognition rates

Conclusions

This paper addresses the problem on how to select features with power of sparse reconstructive relationship preserving. In theory, we prove our feature subset is the optimal solution in closed form if the sparse representation vectors are fixed. Experiments are done on the ORL and Yale face image databases, and results demonstrate our proposed sparsity preserving score is more effective than data variance, Laplacian score, MCFS, and SPEC.