A supervised discriminant data representation: application to pattern classification

The performance of machine learning and pattern recognition algorithms generally depends on data representation. That is why, much of the current effort in performing machine learning algorithms goes into the design of preprocessing frameworks and data transformations able to support effective machine learning. The method proposed in this work consists of a hybrid linear feature extraction scheme to be used in supervised multi-class classification problems. Inspired by two recent linear discriminant methods: robust sparse linear discriminant analysis (RSLDA) and inter-class sparsity-based discriminative least square regression (ICS_DLSR), we propose a unifying criterion that is able to retain the advantages of these two powerful methods. The resulting transformation relies on sparsity-promoting techniques both to select the features that most accurately represent the data and to preserve the row-sparsity consistency property of samples from the same class. The linear transformation and the orthogonal matrix are estimated using an iterative alternating minimization scheme based on steepest descent gradient method and different initialization schemes. The proposed framework is generic in the sense that it allows the combination and tuning of other linear discriminant embedding methods. According to the experiments conducted on several datasets including faces, objects, and digits, the proposed method was able to outperform competing methods in most cases.


Introduction
Modern systems of interest based on computer vision, such as driver assistance systems, healthcare, or surveillance systems, may be characterized as high-dimensional systems generally embedded onto low-dimensional manifolds that preserve the intrinsic properties of the original data. Learning good representations of the data able to extract and organize the discriminative information is of great interest. It may reduce the memory and computational requirements and, more importantly, tends to improve the performance of classifiers or other predictors. This explains why representation learning is becoming a hot research topic (e.g., [15,16,21,22,34,[41][42][43]).
Among the various ways to learn representations, this work focuses on feature selection and feature extraction. In general, a feature may be called relevant, irrelevant, or redundant. In general, a feature is called irrelevant if it does not contribute to the improvement of the predictive model and may degrade the classification accuracy when considered in the classification process. Relevant features are those features that contribute to a better prediction model and thus to higher classification accuracy. These features are the ones that the model should extract and select among all others. A redundant feature does not contribute to the model performing better in classification. & F. Dornaika fadi.dornaika@ehu.eus 1 Feature extraction can be performed using linear or nonlinear methods. Most feature extraction methods look for a linear transformation that maps the original features to another space from which latent variables can be obtained. In these methods, feature ranking or selection can be imposed by adding a ' 2;1 -norm constraint on the projection matrix in the global criterion, which has demonstrated the robustness and joint sparsity property of the resulting projection matrix. [33,35]. For example, algorithms such as robust sparse linear discriminant analysis (RSLDA) [35] have been proposed to address many limitations of the classical linear discriminant analysis (LDA) algorithm [27], where the norm ' 2;1 is imposed on the regularization term in the global criterion to ensure that the method performs feature ranking and weighting. This ' 2;1 norm has been also adopted to construct supervised model where the margins of samples from the same class are greatly reduced, while the margins of samples of different classes are significantly increased [37].
Nowadays, researches focus on the use of linear projection models that perform feature selection (implicit ranking) and extraction simultaneously [35,44]. In this sense, low-rank representation (LRR) methods provide effective subspace learning representations that have been shown to be more robust by exploring global representative structural information between samples. However, conventional LRR approaches fail to provide a low-dimensional projection of training samples with supervised information. This problem was recently approached in [13] by presenting a method that integrates the properties of LRR with supervised dimensionality reduction techniques to simultaneously obtain an optimal low-rank subspace and a discriminative projection. Other methods instead use least squares regression frameworks to achieve a discriminative feature extraction [37], or extract the optimal projection matrix using the elastic net penalty and fuzzy set theory with the locality-preserving projection technique [31] In this paper, we present a unified and hybrid discriminant embedding method that can retain the strengths of two recent discriminant methods: (i) robust sparse linear discriminant analysis (RSLDA) [35] and (ii) inter-class sparsity-based discriminative least square regression (ICS_DLSR) [37]. The former promotes linear discriminant analysis with implicit feature selection, and the latter promotes inter-class sparsity, which means that the projected features share a common sparse structure for the samples in each class.
Thus, the main contributions are as follows. First, we deduce a novel objective function to estimate the linear transformation which has proven to refine the solution of RSLDA (projection matrix Q). Second, we provide an optimization algorithm in which the linear transformation is estimated by a gradient descent method. This allows setting the initial projection matrix to a hybrid combination of transformation matrices obtained from both ICS_DLSR [37] and RSLDA [35]) methods. Finally, we propose two initialization procedures for the linear transformation, which lead to two variants of the proposed algorithm.
Indeed, our approach takes advantage of two powerful discriminative methods at two levels: (1) the initialization of the hybrid linear transformation and (2) the refinement by the proposed single new criterion. The proposed method is also able to obtain a well-constructed projection space that ensures high classification accuracy. It can also be used to tune an already obtained projection matrix. Our approach can be generic in the sense that any hybrid initial projection matrix can be fed into our algorithm and then a more discriminative solution for the projection matrix is obtained, resulting in higher classification performance.
The main difference with existing work on discriminant data representation is the joint use of three key points. The first is that the projection matrix contains two different types of discriminative features, namely inter-class sparsity and robust LDA. Second, the optimization of the proposed global criterion initializes the projection matrix with a hybrid solution. And finally, a gradient descent approach is used in the optimization scheme to tune this hybrid solution for the projection matrix.
The experiments conducted show that the proposed method resulted in an improvement in classification accuracy in the majority of tested cases and was able to outperform several competing methods. The rest of the paper is organized as follows. Section 2 describes related work and presents the notations used. Section 3 presents the criterion and solution details of the proposed method along with two initialization procedures. The obtained experimental results are presented in Sect. 4. Finally, Sect. 5 concludes the paper.

Related work and notations
In this section, we describe some related works and briefly introduce the gradient descent method and how we used it to obtain a better embedding space by selecting the best and most relevant features of the data. In addition, we will show how the introduction of the ' 2;1 [38] norm and interclass sparsity constraint was used for feature selection and helped in discrimination [26], and enumerate some recent methods that have used such a constraint by embedding it in their global criterion to ensure that the method performs feature selection [9,18].

Notations
We will start by introducing the notations that we use in our paper. We will refer for the training set by X ¼ ½x 1 ; x 2 ; :::; x N 2 R dÂN , with d the dimension of each sample.
Each sample x j is a column vector with d features 2 R d . The number of training samples will be denoted by N. In addition, C will represent the total number of classes. The Frobenius norm of a matrix Z 2 R dÂN is obtained through , and the ' 2 norm for the vector z ¼ ½z 1 ; z 2 ; :::; z d is obtained as follows: Table 1 shows the main notations used in our paper.

Related work
Recently, many feature extraction methods have been proposed. Some of these methods have built-in constraints that implement feature ranking/selection in the method and rank the features of their projection matrices. Feature selection or ranking is becoming more and more a trending problem in machine learning. Very often, using all data features does not lead to high classification performance. Feature selection aims to efficiently select the most relevant features of the data that best describe the data and improve discrimination [25,32,39,40]. On the other hand, feature extraction aims to derive new sets of features from the original ones. The derived features are usually more discriminative than the original ones. The best known method to tackle the curse of high dimensionality is the principal component analysis (PCA) [24] method. PCA is an unsupervised feature extraction method that transforms the features of the original data and projects them into a low-dimensional space. In the wellknown supervised linear discriminant analysis (LDA) [8,27] method, label information is required for the data. LDA and its variants are among the most widely used and discriminative algorithms in machine learning. LDA estimates a projection matrix in which the desired space maximizes the variance between classes and minimizes the variance within classes. The projection axis w would be the solution to the Fisher criterion [14]: where k is a small positive constant that balances the effect of the two scatter matrices (within-class scatter matrix S w and between-class scatter matrix S b ), which could be calculated as: where l, l i are the mean of all data samples and the mean of samples of the ith class, respectively. Many variants of LDA were proposed and still being proposed (e.g. [5,45,46]), as the linear discriminant analysis showed good interpretability for the data. RSLDA [35] was proposed to address many limitations of classical LDA [27], RSLDA mainly adds ' 2;1 regularization to the projection matrix. This regularization term is added to the global criterion to ensure that the method performs feature ranking and weighting. RSLDA minimizes the following criterion: where Q 2 R dÂd and P 2 R dÂd denote the projection matrix and the orthogonal matrix, respectively. E 2 R dÂN is the error matrix. S 2 R dÂd is the difference matrix ðS w À k S b Þ, and k 1 and k 2 are two regularization parameters that balance the importance of the different terms. In the criterion of RSLDA, the ' 2;1 norm was imposed on the projection matrix to achieve feature selection.

Review of inter-class sparsity least square regression:
In [37], the authors propose the inter-class sparsity-based discriminative least square regression ICS_DLSR [37]. This method provides a linear mapping to the soft label space, where the dimension of the latent space is set to the number of classes. This method was able to construct a model in which the margins of samples from the same class are greatly reduced, while the margins for samples from different classes are increased. This was achieved by adding a class-wise row-sparsity constraint for the transformed features. ICS_DLSR minimizes the following problem: where X 2 R dÂN is the training set with N samples from C classes. Y 2 R CÂN is a binary label matrix corresponding to the training set. Q is the projection matrix and E denotes the errors. k 1 , k 2 and k 3 are three regularization parameters. Another similar method is the one described in [26], where the ' 2;1 -norm is applied to the transformation of the original linear discriminant analysis.

Proposed method
In this section, we present our problem formulation and show the steps applied to solve it. Our method is mainly considered as a linear projection method used for feature extraction, aiming at finding a more discriminative projection matrix. Two variants of the method are proposed. These two variants differ in the initialization step. Our proposed method has adopted feature ranking by using the solution of RSLDA as the initial estimate for the sought transformation. The next step is to fine tune the initial guess for the projection matrix by minimizing the proposed criterion with a gradient descent method, which aims to find the required solution of the projection matrix Q.
The gradient descent algorithm is one of the simplest and most efficient algorithms for solving unconstrained optimization problems. In our algorithm, we have used the gradient descent approach to compute the projection matrix Q and find the solution.

Formulation
The main goal of our approach is to obtain both the projection matrix Q 2 R dÂd and the orthogonal matrix P 2 R dÂd using a unique criterion. In fact, the main contribution consists of the following objective function: s:t: P T P ¼ I where X i 2 R dÂn i is the data matrix belonging to the ith class, n i is the number of training samples in the ith class, and C is the number of classes. The first term in equation (6) is the LDA criterion, where S is the LDA scatter matrix, which can be calculated as S ¼ S w À k S b , where S b is the between-class matrix and S w is the within-class matrix. These two matrices are given by equations (2) and (3), respectively. This first term provides the first type of data discrimination in the projected space. The second term of the criterion is to ensure that transformed features of the same class share a common sparse structure in the projected space. Q is the projection matrix we are looking for. The second term provides the second type of data discriminant in the projected space. In addition, a variant of the (PCA) constraint is introduced to guarantee that the original data are well recovered. This third term improves the quality of the sample discrimination, as shown in [35]. k 1 and k 2 are two trade-off parameters that can be used to control the importance of the different terms. It is known that the ' 2;1 -norm of a matrix can be written as: where D is a diagonal matrix that is given by: where Z j ð Þ represents the jth row of Z. By substituting the second term of the criterion by its trace form shown in equation (7), problem (6) can be viewed as: min Q;P f ðQ; PÞ s:t: P T P ¼ I: Equation (9) represents the criterion for the proposed method. The minimization of the first term of this criterion is targeting a projection matrix that ensures class discrimination with linear discriminant analysis (LDA). The second term of the criterion is introduced to obtain class sparsity. By introducing this condition, the transformed features from each class obtain a common sparse structure. Finally, a variant of the ''principal component analysis'' constraint is introduced in our proposed criterion [10]. This last constraint was introduced to maintain the energy-preserving property of PCA, and this constraint ensures the robustness of our data.
To find a solution for the proposed method, we used the gradient descent algorithm. Gradient descent algorithm is a mathematical process used for minimizing a particular function. The gradient algorithm uses the first derivative of theobjective function given in Eq. 9. The gradient algorithm allows the person to solve the optimization problem in such a way that one knows the gradient from a particular point and can move in that direction to get a solution. The use of gradient algorithm has many advantages, and we mention the most important of them as follows: • It has less computational complexity compared to other methods. Finding the solution by the descent gradient algorithm is usually less computationally expensive. Using the descent gradient to find a solution results in a faster model. • It leads to accurate solutions. The gradient algorithm leads to a more accurate solution to the minimization problem than the closed-form solution.

Solution steps to the proposed method
To solve the problem formulated above, we adopted the alternating direction method of multipliers (ADMM) [1] and calculated each variable, while other variables are fixed as follows: • Calculate the orthogonal matrix P: P can be calculated by fixing the variable Q and through solving the following equation: Using P T P ¼ I and the fact that the squared norm of a matrix A is given by (10) is equivalent to the following maximization: One can find a solution to equation (11) by performing singular value decomposition of X X T Q. Suppose the SVD decomposition is given by SVD ðX X T QÞ ¼ U R V T . Then, P is obtained as [46]: • Calculate the Projection matrix Q: Gradient descent is an iterative optimization technique used to minimize a function by moving in the direction of steepest descent in each iteration. The way the gradient method is used differs in different areas. In machine learning and classification, gradient is used to iteratively update the parameters of the desired model. We adopted the gradient descent method to compute Q in each iteration of the proposed method as follows: The orthogonal matrix P is fixed. Let us consider the trace form of the criterion of our problem: We calculate the gradient of the objective function w.r.t. Q as follows: Using the gradient matrix, we can update Q by: where Q tþ1 and Q t denote the projection matrix Q in iteration t þ 1 and iteration t, respectively. The step length (learning rate) is given by a. • Update D i : We update D i ; ði ¼ 1; :::; CÞ by: where is a small positive scalar and Q T X i j ð Þ represents the jth row vector of matrix Q T X i .
Algorithm 1 summarizes our proposed method and describes the main steps for solving problem (6). Calculate the gradient matrix G using Eq. (13) Fix P, update Q ðtþ1Þ using Eq. (14).
The projection of the training and test samples is carried out using the estimated projection matrix Q. This is given by z train ¼ Q T x train and z test ¼ Q T x test , where x train is a training data sample and x test is a test data sample.

Initialization of projection matrix Q
The linear transformation Q needs a good initial estimate, since it is estimated by a gradient descent update rule. In this section, we present two initialization procedures that lead to two variants of the proposed algorithm.

Using RSLDA algorithm
In this variant, the initial estimate Q ð0Þ for the linear projection matrix Q is given by the solution of the RSLDA [35] method (solved by a separate ADMM optimization). RSLDA was able to provide implicit feature selection by imposing the ' 2;1 norm over the sought projection matrix. Moreover, the introduction of the error matrix helped in tracking and modeling the random noise. These introduced terms have resulted in RSLDA obtaining a discriminative and efficient transformation. The solution of our proposed method is computed using the gradient approach, which requires a good initial estimate to ensure good performance. By adopting the solution derived from RSLDA method, our proposed variant could adopt the advantages of this method. Figure (1) describes the initialization process using the projection matrix provided by RSLDA. Fig. 2 Combined initialization using the linear embeddings derived from ICS_DLSR and RSLDA Fig. 1 Graphical illustration of the proposed data representation framework. The transformations provided by RSLDA and ICS-DLSR are combined into a hybrid projection matrix Q Hybrid , which serves as input to our proposed algorithm. After minimizing the global criterion given in Eq. (9), we obtain the optimal projection matrices Q and P. The final data representation is obtained using the projection matrix Q 3.3.2 Hybrid combination of projection matrices obtained from the two embedding methods RSLDA and ICS_DLSR In the second variant of our proposed algorithm, the initial projection matrix Q ð0Þ is set to a hybrid combination of the transformation matrices obtained by the two embedding methods RSLDA [35] and ICS_DLSR [37]. Let the number of rows of the hybrid transform Q ð0Þ be d. The number of columns (projection axes), on the other hand, can be set to any arbitrary value. Without loss of generality, to be consistent with linear methods, we will assume that the total number of columns of Q ð0Þ is d. Thus, Q ð0Þ 2 R dÂd . According to [37], the linear transformation Q ICS DLSR obtained by the ICS_DLSR algorithm is 2 R dÂC , where d and C represent the dimension of the features and the number of classes, respectively. On the other hand, the RSLDA method [35] has its own linear transformation Q RSLDA 2 R dÂd . The sought initial hybrid projection matrix Q ð0Þ used in our algorithm is denoted by Q Hybrid . It is constructed by taking all C columns of Q ICS DLSR to which the first d À C columns of Q RSLDA are attached. The resulting projection matrix Q Hybrid is 2 R dÂd . The strategy for the hybrid initialization methodology is shown in Fig. 2.
In the above construction of the hybrid matrix Q Hybrid , our work fixed the number of projection axes for each projection type to C and d À C for ICS_DLSR and RSLDA, respectively. We emphasize the fact that these dimensions can be changed.
In our experiments, according to Table 2, we can see that the value of C that represents the number of classes varies between 10 and 50 for the datasets used. The number of features for each dataset, d, is also shown in the same table.

Computational complexity
In this section, the computational complexity of the proposed method is analyzed (see Algorithm 1). From [35], we know that the computational complexity of the RSLDA method is O sðd 2 N þ 4d 3 Þ ð Þ , where s denotes the number of iterations of RSLDA, N denotes the number of samples, and d denotes their dimension.
Computational cost of the first variant. Matrices Q, P, are sought to be calculated. The orthogonal matrix P requires a singular value decomposition. The computational cost of a decomposition of a d Â N matrix would be O N 3 ð Þ. Q is computed in the second step of the procedure, it requires the computation of the corresponding gradient matrix, but since these two steps consist only of simple matrix operations, they have low computational cost and therefore can be ignored. Also, the step provided for updating D i from equation (15) is a simple matrix operation that has very low cost. Thus, the cost of one iteration of Algorithm 1 is O N 3 ð Þ. Assuming that s 0 represents the number of iterations of the proposed iterative scheme (Algorithm 1), the cost of Algorithm 1 is O s 0 ðN 3 Þ ð Þ. On the other hand, in the first variant of our proposed method, we used the RSLDA method for initializing the projection matrix Q before feeding it to our algorithm. Therefore, the complexity of the RSLDA method should be added to the complexity of our proposed method.
In summary, the total cost of the first variant would be the sum of RSLDA cost added to the cost of our proposed method, which equals O sðd 2 N þ 4d 3

Performance study
To test both variants of our proposed method, we conducted experiments on several datasets including faces, objects, and handwritten datasets. Detailed information on these datasets is presented in this section. Next, we are going to present the experimental setup and the results obtained.

Datasets
In our work, we have conducted our experiments over the following five public datasets in addition to a large-scale dataset: USPS 1 digits dataset, Honda 2 dataset, COIL20 3 object dataset, Extended Yale B 4 face dataset, FEI 5 dataset, and the large-scale MNIST dataset consisting of 60,000 images.  [29,30]. This dataset consists of 400 data points belonging to four classes. The data points are in R 3 , and this dataset presents the challenge associated with low inter-cluster distances. Table 2 presents a summary for all the information concerning the datasets used in our paper.

Results
As already reported, the proposed method has two variants, namely: • Supervised discriminant analysis using gradient technique (SDA_G_1): In this variant, our proposed method is implemented in the case that the initial projection matrix Q ð0Þ is set to the output of the RSLDA [35] algorithm as presented in Sect. 3.3.1. • Supervised discriminant analysis using gradient via combined initialization (SDA_G_2): The second variant of the proposed method consists of initializing the projection matrix Q ð0Þ as a hybrid combination of the solutions derived from the RSLDA [35] and ICS_DLSR [37] methods. The initial transformation construction is shown in Fig. 2 and detailed in Sect.

3.3.2.
The proposed variants have been compared with the following methods: K-nearest neighbors (KNN) [12], support vector machines (SVM) [3] (the linear SVM was implemented using the LIBSVM library) 12 linear discriminant analysis (LDA) [27], local discriminant embedding (LDE) [4], PCE [20] (unsupervised method) ICS_DLSR [37] and robust sparse LDA (RSLDA) [35]. All experiments for all compared methods were conducted under the same conditions to guarantee a fair comparison. For each compared embedding method, the whole dataset is randomly split into a training part and a test part.
First, for each compared method, a projection matrix is estimated from the training part. Then, training and test data are projected onto the new space using the already Fig. 3 Some images of datasets computed transformation. Finally, the classification of the test data is then performed using the nearest-neighbor classifier (NN) [6].
Different sizes of training sets were used. Moreover, for a given percentage of training data, the whole evaluation is repeated 10 times. That means that we adopt 10 random splits for every configuration and report the average recognition rate (rate of correct classification of test part) over these 10 random splits.
We used PCA as a preprocessing technique. In our experiments, PCA [24] is used as a dimensionality reduction technique and used to preserve 100% of the data's energy. Concerning the parameter a, we should set it to a small value. In our experiments, this value was chosen in f10 À7 ; 10 À5 g.
The obtained results are summarized in Table 3. This table depicts the classification performance of the proposed variants in addition to the competing methods using the USPS, Honda, FEI, and COIL20 datasets. The results are obtained using different training and testing percentages from the data. The results shown in this table are obtained using the nearest-neighbor classifier. Table 4 presents the obtained classification performance using the Extended Yale B dataset. In this table, various training percentages corresponding to a different number of samples used in the training process are shown. We should emphasize that more competing methods are presented in table 4, and these additional methods are ELDE, in addition to SULDA and MPDA. These methods were added to enrich the comparison using more methods. To further improve the comparison over the Extended Yale B dataset, we have added more methods to the comparison table, based on low-rank representations. The added methods are the lowrank linear regression (LRLR) [2], low-rank ridge regression (LRRR) [2], sparse low-rank regression (SLRR) [2], and the low-rank preserving projection via graph regularized reconstruction (LRPP GRR) [36]. Low-rank-based methods findings can be found in the bottom part of table 4. The depicted rates are the average over 10 random splits and correspond to different numbers of training samples. The first column inside the table depicts the number of training images per class. Table 5 illustrates the classification performance for the competing methods alongside the proposed variants using the large-scaled MNIST dataset that contains a total number of 60,000 images in total. Results shown in this table are obtained using one split while using 1000 samples from each class for training and the remaining samples were used for testing. Table 6 depicts the obtained classification performance using the News20 text dataset. The results presented in this table are the mean classification obtained using 10 split while using 20% and 30% of the data samples from each class for training and the remaining samples were used for testing. Figure 5 presents the obtained recognition rate (%) associated with the LDA [27], LDE [4], and RSLDA [35] in addition to the two proposed variants of our method. The recognition rate is plotted as a function of the dimension of the projected features. The results are shown for (a) the COIL20 dataset, (b) the Extended Yale B, and (c) the HONDA dataset. 30, 10, and 10 samples from each class are used for training, respectively. The depicted results were obtained using the nearest-neighbor (NN) classifier. We have used the results obtained from 21 evaluations and using six different datasets from the experiments conducted in this paper to study the statistical analysis of our proposed method. We performed the Friedman test [7] and computed the critical distance CD. The obtained results of the conducted test yield to the conclusion that the tested methods do not have the same performance. Figure 4 shows the CD diagram for the nine methods including our two proposed variants, where the average rank of each is marked along the axis.
Experiments using synthetic data: In addition to the image datasets, we also conducted some experiments on the synthetic Tetra dataset [28]. This dataset consists of 400 data points belonging to four classes. The original data points of this dataset are in R 3 , but in our experiments, the dimension was augmented to 100 so each data sample is represented by 100 features. The three-dimensional dataset is transformed to a highdimensional dataset 2 R 100 using a random projection matrix.
This dataset was chosen because it presents the challenge associated with low inter-cluster distances. The distance between the clusters is minimal. Data points of Tetra are visualized in Fig. 6. One can see that the clusters nearly touch each other. Figure 7 illustrates the TSNE visualization of the projected samples of the Tetra dataset using the original linear discriminant analysis (LDA) and RSLDA in addition to the first variant of our suggested method SDA_G_1. By looking at that figure, it is noticeable that our method provides very good class separation properties and leads to the most compact representation among competing methods. The proposed method ensured superior performance when it is applied on datasets with low inter-cluster distances.

Analysis of parameter sensitivity
In this section, we investigate the effect of changing the proposed method's parameters on the classification performance using different datasets. The proposed method has mainly two parameters to be configured, namely k 1 and k 2 . Figure 8 shows the variation of the classification performance when adopting different parameter combinations of the proposed method using the Extended Yale B and Honda datasets. Figure 8a and 8c shows the variation of the classification rates using the Extended Yale B and Honda datasets in the cases of using 10 and 20 training samples from each class, respectively, using the first variant of the proposed method SDA_G_1. Also, the classification performance is studied on the same datasets with adopting the same training percentages using the second variant of the proposed method SDA_G_2. The corresponding results are depicted in Fig. 8b and d.
For the Extended Yale B dataset, we monitored the classification performance obtained by both of our proposed variants using different values for k 1 and k 2 . k 1 and k 2 were varied for the ranges from ½10 À5 ; 1 and ½10 À3 ; 10 respectively. We noticed that satisfactory rates for the Extended Yale B dataset can be obtained when k 1 was chosen from the range ½10 À3 ; 10 À1 and k 2 within the range of ½10 À2 ; 10 À1 .
Similar to the Extended Yale B experiment, we studied the classification performance of the proposed schemes over the Honda dataset. We varied k 1 in the range of ½10 À3 ; 10 3 and k 2 in the range ½10 À4 ; 10 3 . We noticed that satisfactory rates using Honda dataset can be obtained by choosing the value k 1 from the range of ½10 À1 ; 10 and k 2 from the range of ½10 À3 ; 10 2 . We concluded that the values of the parameters k 1 and k 2 should lie in the intervals shown in the figures above to obtain satisfactory results using the proposed method. A value of 0.1 for both k 1 and k 2 seems to be a good choice for the two variants and the two datasets. Figure 8 shows the variation of the classification performance according to the change of the parameters k 1 and k 2 . This figure corresponds to the variants of the proposed method when applied on the Extended Yale B and Honda dataset using 10 and 20 samples from each class for training respectively and the rest for testing.

Analysis of results
From our analysis of the experiments conducted, we can make the following observations: 1. The classification performance obtained by the proposed method alongside the competing methods demonstrates that our proposed approach has outperformed competing methods in the majority of the cases. 2. The first proposed variant SDA_G_1 has slightly outperformed the RSLDA method. This seems to be very realistic since the first proposed method mainly provides a fine-tuning of the RSLDA transformation. 3. In general, the second proposed scheme SDA_G_2 is superior to the first proposed one SDA_G_1. This is explained by the fact that the second scheme benefits from the hybrid combination of two different powerful embedding methods as well as from the refinement provided by the gradient descent tool.  4. The proposed method proved superior performance using several types of image datasets, including faces, objects, and digits. Also, our approach demonstrated superior performance using a text dataset. 5. The proposed method showed superiority and led to very good class separation properties when it is applied on datasets with low inter-cluster distances.
6. The optimal parameters of the proposed method, which gives the best classification performance, have large ranges. In other words, the best classification performance is guaranteed most of the time by searching a small number of parameter combinations. 7. The competing method ICS_DLSR has performed better than our proposed method in a particular case using the COIL20 dataset while using 20 images from On the other hand, the proposed method generally outperformed it using all other training percentages for the same dataset. 8. When the hybrid initialization was used in our algorithm, we adopted a combination of the two besttuned transformation matrices obtained from the two methods RSLDA and ICS_DLSR as the initial transformation.
In the majority of the tested cases, this has led to a noticeable enhancement in classification performance. The two best-tuned transformation matrices refer to the transformation matrices computed by two methods using the best parameter combination, which leads to the optimal performance of the method. It is worth noting that the use of the combination of the two tuned transformation matrices is not necessarily the best option for a combination in our framework.
Other combinations may lead to better discrimination. Thus, the obtained classification performance using the second variant of our suggested approach (Table 3) could be further enhanced if other combinations for the initialization are used.

Conclusion
In this work, we introduced a novel criterion to obtain a discriminant linear transformation. This transformation efficiently integrates two different mechanisms of discrimination which are the inter-class sparsity and robust discriminant analysis. We deployed an iterative alternating minimization scheme to estimate the linear transformation and the orthogonal matrix associated with the robust LDA. In the alternating optimization, the linear transformation is efficiently updated via the steepest descent gradient technique.
We proposed two initialization variants for the linear transformation. The first scheme sets the initial solution to the linear transformation obtained by robust sparse LDA method (RSLDA). The second variant initializes the solution to a hybrid combination of the two transformations obtained by RSLDA and ICS_DLSR methods. The two variants of the proposed method have demonstrated superiority over competing methods and led to a more discriminative projection matrix, hence better classification performance. The main difference with existing work on discriminant data representation is the joint use of three key points. The first is that the projection matrix contains two different types of discriminative features, namely interclass sparsity in addition to robust LDA. Second, the optimization of the proposed global criterion initializes the projection matrix with a hybrid solution. And finally, a gradient descent approach is used in the optimization scheme to tune this hybrid solution for the projection matrix.
The proposed framework is generic in the sense it allows the combination and tuning of other linear discriminant embedding methods.
Deep learning can provide a powerful data representation or classifier. However, this paper is about building a shallow discriminant model that can be trained with a few examples, e.g., a few dozen images. On the other hand, a deep learning model may need a large number of examples to provide a good data representation. It is worth mentioning that the presented projection method can also work with deep features, in the sense that the descriptors of the images can be provided by a deep model and the projection model can be provided by the proposed model. In other words, the proposed model can be used as a projection head of a pre-trained deep neural network.

Declarations
Conflict of Interest Statement The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.