1 Introduction

Multi-label learning deals with the problem where each sample is represented by a feature vector and associated with multiple concepts or semantic labels. For example, in image annotation, an image may be annotated with different scenes, or in text categorisation, a document may be tagged to multiple topics. Formally, given a data matrix \(X \in \mathbb {R}^{d \times n}\) is composed of n samples of d-dimensional feature space. The feature vector \(x_{i} \in X\) is associated with label set \(Y_{i}=\left\{ y_{i1},y_{i2},_{\cdots },y_{ik} \right\} \) where k is the number of labels, and \(y_{(i,j)} \in \left\{ 0,1 \right\} \) is a logical value where the associated label is relevant to the instance \(x_{i}\). Over the past decade, many strategies have been proposed in the literature to learn from multi-labelled data. Initially, the problem was tackled by learning binary classification models on each label independently [21]. However, this strategy ignores the existence of label correlation. Interestingly, several methods [3, 17] show the importance of considering the label correlation during multi-label learning to improve classification performance. However, these methods use logical labels where no manifold exists and apply traditional similarity metrics such as Euclidean distance, which is mainly built for continuous data [12]. In this study, contrary to most methods, in addition to learning the mapping function from feature space to multi-label space, we explore the projection function from label space to feature space to reconstruct the original feature representations. For example, our novel method can reconstruct the scene image using the projection function and the semantic labels in image annotation. Initially, it is necessary to explore the natural structure of the label space in multi-labelled data. Existing datasets naturally contain logical label vectors indicating whether the instance is relevant or not to a specific label. For example, as shown in Fig. 1, both images tagged the label boat with the same weight equal to 1 (present). However, to accurately describe the labels in both images, we need to identify the importance of the labels in each image. It is clearly seen that the label boat in the image (Fig. 1b) is more important than that in the image (Fig. 1a). Furthermore, the label with the zero value in the logical label vector refers to different meanings, which may be irrelevant, unrepresented or missing. Using the same example in Fig. 1, the boat and the sun label in images (Fig. 1a and b) are not tagged due to their small contribution (unrepresented). Our method learns a numerical multi-label matrix in semantic embedding space during the optimisation method based on label dependencies. Therefore, replacing the importance of labels using numerical values instead of logical labels can improve the multi-label learning process.

Learning the numerical labels is essential to our novel approach that is developed based on the encoder–decoder deep learning paradigm [15]. Specifically, the input training data in the feature space are projected into the learned semantic label space (label manifold) as an encoder step. This step, simultaneously through an optimisation problem, learns the projection function and the semantic labels in Euclidean space. Significantly, we consider the reconstruction task of the original feature representations using the projection matrix as input to a decoder. This step imposes a constraint to ensure that the projection matrix preserves all the information in the original feature matrix. The decoder allows the original features to be recovered using the projection matrix and the learned semantic labels. In the case of image annotation, this process is similar to combining puzzle pieces to create the picture. However, in the case of logical labels, where the label either exists or does not, it cannot reconstruct the original visual feature representations. We show that the impact of the decoder in identifying the relevant features can improve multi-label classification performance. This is because the feature coefficients are estimated based on the actual numerical labels, and more importantly, the weights of the relevant features reduce the reconstruction error. The proposed method is visualised in Fig. 2. We test the proposed approach on various public multi-label datasets and verify that they favourably outperform the state-of-the-art methods in feature selection and data reconstruction.

We formulate the proposed approach as a constrained optimisation problem to project feature representations into semantic labels with a reconstruction constraint. More precisely, the method is designed by effectively formulating the encoder and decoder model using a linear projection to and from the learned semantic labels. This design alleviates the computational complexity of the proposed approach, making it suitable for large-scale datasets. To the best of our knowledge, this is the first attempt to learn the semantic label representation from the training data that can be used for data reconstruction in multi-label learning. In summary, our contributions are (1) a semantic encoder–decoder model to learn the projection matrix from original features to semantic labels that can be used for data reconstruction; (2) we extend the logical label to a numerical label that describes the relative importance of the label in a specific instance; (3) we propose a novel Robust Multi-label Feature Learning based Dual Space (RMFDS) which can identify discriminative features across multiple class labels.

Fig. 1
figure 1

Two images annotated with labels: sea, sunset, and boat

2 Related work

Multi-label classification aims to predict the set of labels corresponding to instances. It was widely applied in several domains, including image recognition. Several methods have been proposed recently to predict image labels  [18, 22]. This study briefly describes the work on label correlation, semantic labels, and autoencoders.

Label correlation   Over the past decade until recently, it has been proven that label correlation has improved the performance of multi-label learning methods. The correlation is considered either between pairs of class labels or between all the class labels, which is known as second and high-order approaches, respectively [6, 23]. Recently, a method has been proposed to learn common label-specific features using the correlation information from labels and instances [14]. Another study is proposed for feature selection based on label correlations [8]. However, in these models, the common learning strategy deals with logical labels, representing whether the label is relevant or irrelevant to an instance. The label matrix in the available multi-labelled datasets contains logical values which lack semantic information. Hence, few works reveal that transforming the labels from logical into numerical values improves the learning process. Semantic labels   The numerical value in the label space carries semantic information, i.e. the value may refer to the importance or the weight of an object in the image. The numerical label matrix in Euclidean space is not explicitly available in the multi-labelled data. A few works have studied the multi-label manifold by transforming the logical label space to the Euclidean label space. For example, [9] explores the label manifold in multi-label learning and reconstructs the numerical label matrix using the instance smoothness assumption. Another work [4] incorporates feature manifold learning in the multi-label feature selection method, and [10] selects the meaningful features using the constraint Laplacian score in manifold learning. However, our proposed method differs from these by learning an encoder–decoder network to reconstruct the input data using the learned projection matrix and predicting the semantic labels.

Autoencoder   Several variants use an autoencoder for multi-label learning. [7] learns the unknown labels using the entropy measure from existing labels, and then the completed label matrix is used as an input layer feature set in autoencoder architecture. However, our method reconstructs the original input data using the learned semantic labels in the decoder. Further, [13] proposes a stacked autoencoder for feature encoding and an extreme learning machine to improve the prediction capability. However, the authors did not take label correlation into consideration, and the original logical labels are used in the learning process. A study proposes a novel method to learn low-dimensional manifolds to capture nonlinear features using autoencoders [11]. A method is proposed to transform the medical image annotation to multi-label classification using a denoising autoencoder [5]. In this paper, we select the discrete features that are important to detect the objects’ weights during the encoding phase and simultaneously, they are significant to reconstruct the original data in the decoding phase.

3 The proposed method

In multi-label learning, as mentioned above, the training set of multi-labelled data can be represented by \(\left\{ x_{i} \in X | i=1,\cdots ,n \right\} \), the instance \(x_{i} \in \mathbb {R}^{d}\) is a d-dimensional feature vector associated with the logical label \(Y_{i}=\left\{ y_{i1},y_{i2},_{\cdots },y_{ik} \right\} \), where k is the number of possible labels, and the values 0 and \(+1\) represent the irrelevant and relevant label to the instance \(x_{i}\), respectively.

3.1 Label manifold

To overcome the key challenges in logical label vectors, we first propose to learn a new numerical label matrix \(\widetilde{Y} \in \mathbb {R}^{k \times n}\) which contains labels with semantic information. According to the label smoothness assumption [19], which states that if two labels are semantically similar, then their feature vectors should be similar, we initially exploit the dependencies among labels to learn \(\widetilde{Y}\) by multiplying the original label matrix with the correlation matrix \(C \in \mathbb {R}^{k \times k}\). Due to the existence of logical values in the original label matrix, we use the Jaccard index to compute the correlation matrix as follows

$$\begin{aligned} \widetilde{Y}=Y^{T}C \end{aligned}$$
(1)

where the element \(\widetilde{Y}_{i,j}=Y_{i,1}^{T} \times C_{1,j} + Y_{i,2}^{T} \times C_{2,j} + \cdots + Y_{i,n}^{T}\times C_{n,j}\) is initially determined as the predictive numerical value of instance \(x_{i}\) is related to the j-th label using the prior information of label dependencies. Following is a simple example to investigate the efficiency of using label correlation to learn semantic numerical labels. The original logical label vectors Y of images (Fig. 1a and b) are shown in Fig. 2. The zero values of the original label matrix point to three different types of information. The grey and red colours in Fig. 2 refer to unrepresented and missing labels. However, the white colour means that images (Fig. 1a and b) are not labelled as “Grass”. We clearly show that the predictive label space \(\widetilde{Y}\) can distinguish between three types of zero values and provides the appropriate numerical values, including semantic information. For example, due to the correlation between “Boat” and “Ocean” and “Sunset” and “Sun”, the unrepresented label information for “Boat” and “Sun” in the image (Fig. 1a) and “Sun” in the image (Fig. 1b) is learned. Further, the missing label of the “Ocean” image (Fig. 1b) is predicted. Interestingly, we can see in the predictive label matrix that the numerical values of the “Grass” label in both images are very small because the “Grass” label is not correlated with the other labels. Thus, this perfectly matches with the information in the original label matrix that the “Grass” object does not exist in images (Fig. 1a and b) as shown in Fig. 2. Therefore, we can learn accurate numerical labels with semantic information based on the above example. The completed numerical label matrix \(\widetilde{Y}\) of the training data is learned in the optimisation method of the encoder–decoder framework in the next section.

Fig. 2
figure 2

Encoder–decoder architecture of our proposed method RMFDS. A predicted numerical label matrix \(\widetilde{Y}\) is initialised by multiplying the correlation matrix with the original logical label matrix Y. The zero values of the images in red, grey and white boxes represent missing, misrepresented, and irrelevant labels, respectively

3.2 Approach formulation

Suppose there are training data \(X \in \mathbb {R}^{d \times n}\) with n samples that are associated with \(Y \in \mathbb {R}^{k \times n}\) labels. The predictive numerical matrix \(\widetilde{Y}\) is initialised using Eq. 1. The intuition behind our idea is that the proposed method can capture the relationship between the feature space and the manifold label space. Inspired by the autoencoder architecture, we develop an effective method that integrates the characteristics of the low-rank coefficient matrix and semantic numerical label matrix. Specifically, our method is composed of encoder–decoder architecture, which tries to learn the projection matrix \(W \in \mathbb {R}^{k \times d}\) from the feature space X to the numerical label space \(\widetilde{Y}\) in the encoder. At the same time, the decoder can project back to the feature space with \(W^{T} \in \mathbb {R}^{d \times k}\) to reconstruct the input training data as shown in Fig. 2. The objective function is formulated as

$$\begin{aligned} \begin{aligned} \min _{W}&\left\| X - W^{T}WX \right\| ^{2}_{F}&\text {s.t. } WX=\widetilde{Y} \end{aligned} \end{aligned}$$
(2)

where \(\left\| . \right\| _{F}\) is the Frobenius norm.

3.2.1 Optimisation algorithm

To optimise the objective function in Eq. 2, we first substitute WX with \(\widetilde{Y}\). Then, due to the existence of constraint \(WX=\widetilde{Y}\), it is very difficult to solve Eq. 2. Therefore, we relax the constraint into a soft one and reformulate the objective function (2) as

$$\begin{aligned} \min _{W} \left\| X - W^{T}\widetilde{Y} \right\| ^{2}_{F} + \lambda \left\| WX - \widetilde{Y}\right\| ^{2}_{F} \end{aligned}$$
(3)

where \(\lambda \) is a parameter to control the importance of the second term. Now, the objective function in Eq. 3 is non-convex, and it contains two unknown variables W and \(\widetilde{Y}\). It is difficult to solve the equation directly. We propose a solution to iteratively update one variable while fixing the other. Since the objective function is convex by updating one variable, we compute the partial derivative of Eq. 3 to W and \(\widetilde{Y}\) and set both to zero.

Algorithm 1
figure a

Robust Multi-label Feature Learning based Dual Space (RMFDS)

  • Update W:

    $$\begin{aligned} \begin{aligned}&-\widetilde{Y}\left( X^{T} - \widetilde{Y}^{T}W \right) + \lambda \left( WX-\widetilde{Y} \right) X^{T}=0 \\&\Rightarrow \widetilde{Y} \widetilde{Y}^{T}W + \lambda WXX^{T}=\widetilde{Y}X^{T} + \lambda \widetilde{Y} X^{T} \\&\Rightarrow PW + WQ=R \end{aligned} \end{aligned}$$
    (4)

    where \(P=\widetilde{Y}\widetilde{Y}^{T}\), \(Q=\lambda XX^{T}\), and \(R=\left( \lambda + 1 \right) \widetilde{Y}X^{T}\)

  • Update \(\widetilde{Y}\):

    $$\begin{aligned} \begin{aligned}&-WX+WW^{T}\widetilde{Y} + \lambda \left( -WX + \widetilde{Y} \right) =0 \\&\Rightarrow WW^{T}\widetilde{Y} + \lambda \widetilde{Y} = WX + \lambda WX \\&\Rightarrow A\widetilde{Y}+\widetilde{Y}B=D \end{aligned} \end{aligned}$$
    (5)

    where \(A=WW^{T}\), \(B=\lambda I\), \(D=\left( \lambda + 1 \right) WX\) and \(I \in \mathbb {R}^{k \times k}\) is an identity matrix.

Equations 4 and 5 are formulated as the well-known Sylvester equation of the form \(MX+XN=O\). The Sylvester equation is a matrix equation with given matrices MN,  and O, and it aims to find the possible unknown matrix X. The solution of the Sylvester equation can be solved efficiently and lead to a unique solution. For more explanations and proofs, the reader can refer to [1]. Using the Kronecker products notation and the vectorisation operator vec, Eqs. 4 and 5 can be written as a linear equation, respectively.

$$\begin{aligned} \left( I_{d} \otimes P + Q^{T} \otimes I_{k} \right) vec(W)=vec(R) \end{aligned}$$
(6)

where \(I_{d} \in \mathbb {R}^{d \times d}\) and \(I_{k} \in \mathbb {R}^{k \times k}\) are identity matrices and \(\otimes \) is the Kronecker product.

$$\begin{aligned} \left( I_{n} \otimes A + B^{T} \otimes I_{k} \right) vec(\widetilde{Y})=vec(D) \end{aligned}$$
(7)

where \(I_{n} \in \mathbb {R}^{n \times n}\) is an identity matrix. Fortunately, this equation can be solved in MATLAB with a single line of code Sylvester.Footnote 1 Now, the two unknown matrices W and \(\widetilde{Y}\) can be iteratively updated using the proposed optimisation rules above until convergence. The procedure is described in Algorithm 1.

Our proposed method learns the encoder projection matrix W. Thus, we can embed a new test sample \(x^{s}_{i}\) to the semantic label space by \(\widetilde{y}_{i}=Wx^{s}_{i}\). Similarly, we can reconstruct the original features using the decoder projection matrix \(W^{T}\) by \(x^{s}_{i}=W^{T}\widetilde{y}_{i}\). Therefore, W contains the discriminative features to predict the real semantic labels. To identify these features, we rank each feature according to the value of \(\left\| W_{m,:} \right\| _{2} \left( m=1,\cdots ,d \right) \) in descending order and return the top-ranked features.

Table 1 Characteristics of the evaluated datasets
Table 2 Average results of the first 50 features on eight datasets with different missing label proportions

4 Experiments

4.1 Experimental datasets

We open source for our RMFDS code for reproducibility of our experiments.Footnote 2 Our proposed method is coded in MATLAB, and the experiments ran on Intel Core i5-8250U with CPU 1.80 GHz and 8 GB memory. Experiments are conducted on eight public multi-label datasets, which can be downloaded from the Mulan repository.Footnote 3 The details of these datasets are summarised in Table 1. Among these datasets, Scene consists of 2407 images. Each image is associated with six scenes. Emotions contain 593 songs, each related to six emotions. Six Yahoo datasets from the text domain are used, including Business, Computers, Entertainment, Health, Reference, and Science [16].

To evaluate the compared algorithms, we use the standard multi-label evaluation metrics, namely Hamming loss, Ranking loss, Coverage, Average precision, Micro-F1, and Macro-F1 which define in [2]. For the first three, the smaller the value, the better the performance. However, for the other three evaluation metrics, the reverse is true.

4.2 Comparing methods and experiment settings

Multi-label feature selection methods have attracted the interest of researchers in the last decade. In this study, we compare our proposed method RMFDS against the recent state-of-the-art multi-label feature selection methods, including GLOCAL [23], MCLS [10], and MSSL [4]. MCLS and MSSL methods consider feature manifold learning in their studies. ML-KNN (K=10) [20] is used as the multi-label classifier to evaluate the performance of the identified features. The results, based on a different number of features, vary from 1 to 100 features. PCA is applied as a prepossessing step with and retains \(95\%\) of the data. To ensure a fair comparison, the parameters of the compared methods are tuned to find the optimum values. For GLOCAL, the regularisation parameters \(\lambda _{3}\) and \(\lambda _{4}\) are tuned in \(\left\{ 0.0001,0.001,\dots , 1 \right\} \), the number of clusters is searched from \(\left\{ 4,8,16,32, 64 \right\} \), and the latent dimensionality rank is tuned in \(\left\{ 5,10,\dots , 30 \right\} \). MCLS uses the default settings for its parameters. For MSSL, the parameters \(\alpha \) and \(\beta \) are tuned by searching the grid \(\left\{ 0.001,0.01,\dots ,1000 \right\} \). The parameter settings for RMFDS are described in the following section.

Fig. 3
figure 3

The classification results of the compared methods with different evaluation metrics on the Scene dataset

Fig. 4
figure 4

The classification results of the compared methods with different evaluation metrics on the Business dataset

Fig. 5
figure 5

The classification results of the compared methods with different evaluation metrics on the Computers dataset

Fig. 6
figure 6

The classification results of the compared methods with different evaluation metrics on the Emotions dataset

Fig. 7
figure 7

The classification results of the compared methods with different evaluation metrics on the Entertainment dataset

Fig. 8
figure 8

The classification results of the compared methods with different evaluation metrics on the Health dataset

Fig. 9
figure 9

The classification results of the compared methods with different evaluation metrics on the Reference dataset

Fig. 10
figure 10

The classification results of the compared methods with different evaluation metrics on the Science dataset

4.3 Results

4.3.1 Classification results

Several experiments have demonstrated the classification performance of RMFDS compared to the state-of-the-art multi-label feature selection methods. Figures 3, 4, 5, 6, 7, 8, 9 and 10 show the results using six multi-label evaluation metrics namely Hamming loss, Ranking loss, Coverage, Average precision, Micro-F1, and Macro-F1 on eight datasets. The classification results are generated in these figures based on top-ranked 100 features (except the Emotions dataset, which only has 72 features). Based on the experiment results shown in Figs. 3, 4, 5, 6, 7, 8, 9 and 10, interestingly, it is clear that our proposed method has a significant classification improvement with an increasing number of selected features and then remains stable. Thus, this observation indicates that studying dimensionality reduction in multi-label learning is meaningful. Further, it highlights the stability and capability of RMFDS to achieve good performance on all the datasets with fewer selected features.

Fig. 11
figure 11

The convergence and computational time analysis for the RMFDS method using several datasets. a and b are the objective function values, and c are the computational time values

The proposed method is compared with the state of the art on each dataset. In Figs. 3, 4, 5, 6, 7, 8, 9 and 10, RMFDS achieves better results than MCLS, MSSL, and GLOCAL on almost all the evaluated datasets. Specifically, in terms of the Hamming loss, Ranking loss, and Coverage evaluation metrics, where the smaller the values, the better the performance, RMFDS’s features substantially improve the classification results compared to the state of the art. It can be observed that MCLS has the worst results, and MSSL and GLOCAL achieve comparable results as shown in Figs. 6, 6, 7, 8, 9 and 10. In terms of Average precision, Micro-F1, and Macro-F1 evaluation metrics, where the larger the values, the better the performance, RMFDS generally achieves better results against the compared methods in all datasets. We note that RMFDS performs slightly better than MSSL and GLOCAL on the Emotions and Entertainment datasets using Average precision, Micro-F1, and Macro-F1 metrics as shown in Figs. 6 and  7. We also note that the state-of-the-art methods produce unstable results on the Business, Reference, and Computers datasets using the Micro-F1 metric. In general, our proposed method demonstrates the benefits of using label manifolds in encoder–decoder architecture to identify discriminative features.

Fig. 12
figure 12

RMFDS results on Emotions and Reference datasets. a and b are the precision results w.r.t different parameters

Furthermore, using the Friedman test, we investigate whether the results produced by RMFDS significantly differ from the state of the art. In particular, we examine the Friedman test between RMFDS against each compared method for each evaluation metric in the eight datasets. The statistical results show that the p value in all tests is less than 0.05, which rejects the null hypothesis, which denotes that the proposed method and the compared methods have equal performance. Finally, we explored the reconstruction capability of the decoder in RMFDS using the projection matrix W to reconstruct the original data. Table 3 reports the reconstruction errors using the training logical labels (Y), training predicted labels (\(\widetilde{Y}\)), and the logical testing labels. Based on the results, it is observed from eight datasets that the percentage reconstruction error of the original training matrix using the logical training labels only ranges between \(4\%\) and \(8\%\), and this error is dramatically decreased to \(0.1\%\) to \(3\%\) by using the predicted numerical label matrix. This observation reveals that the decoder plays an important role in selecting the important features that can be used to reconstruct the original matrix. Further, it supports our argument to reconstruct the visual images using the semantic labels and the coefficient matrix. In addition, we report the capability of reconstructing the testing data matrix using the projection matrix and the testing labels with a small error that ranges between \(4\%\) and \(8\%\).

Table 3 Reconstruction of different error values using the RMFDS decoder
Fig. 13
figure 13

Results from ablation experiments on multiple datasets for the top 100 features, with a step size of 10

4.3.2 RMFDS results for handling missing labels

The proposed method learns the semantic label matrix during the back-and-forth projections. Therefore, the RMFDS method should be able to learn the missing labels in the original label matrix during the optimisation process. To investigate the proposed method’s ability to handle missing labels, we randomly removed different proportions of labels from the samples from moderate to extreme levels: 20%, 40%, 60%, and 80%. Motivated by RMFDS results in Figs. 3, 4, 5, 6, 7, 8, 9 and 10 that shows stable classification accuracy above 40 features, we compute the average results of the first 50 features in each evaluation metric and report the results in Table 2. Interestingly, the results indicate that our proposed method improves the multi-label classification results in the presence of missing labels. Specifically, we found that RMFDS significantly improved the accuracy in case of extreme missing labels (i.e. 60% and 80%) in four datasets, namely Scene, Emotions, Science and Entertainment (Table 2).

4.4 Parameter sensitivity

In this section, we study the influence of the proposed method’s parameters \(\lambda \) and MaxIteration on the classification results. First, \(\lambda \) controls the contribution of the decoder and encoder in the method. However, the second parameter defines the number of iterations required for convergence. The parameters \(\lambda \) and MaxIteration are tuned using a grid search from \(\left\{ 0.2,0.4,\dots ,2 \right\} \) and \(\left\{ 1,20,40,\dots ,100 \right\} \), respectively. As shown in Fig. 12a and 12b, using the average precision metric on the two datasets, we can observe that with an increasing \(\lambda \), the learning performance is improved.

4.5 Convergence analysis and computational time

We analyse the convergence of RMFDS on eight datasets. Figure 11a and b shows the results of the objective function value on the training datasets. It is obviously seen that our method converges rapidly and has around 10 iterations, which demonstrates the efficacy and speed of our algorithm. We also show in Fig. 11c the computational time RMFDS took using different numbers of iterations. We can observe that the time is linearly increased with the number of iterations.

4.6 Computational analysis

In this section, we will study the computational analysis of the key operations involved in Algorithm 1, including initialisation and While loop.

Initialisation. Computing \(C_{jk}\) involves several summations and multiplications. The computational complexity is \(\mathcal {O}(n)\), and calculating \(\widetilde{Y}\) has a time complexity of \(\mathcal {O}(dkn)\).

While Loop (MaxIteration). The loop runs for a maximum of MaxIteration times. Within each iteration, updating W and \(\hat{Y}\) using the Sylvester equation by vectorising the matrices and using linear algebra operations. The worst case is \(\mathcal {O}(d^{3})\). The operations inside the loop will dominate the overall time complexity of the algorithm, especially the updates of W and \(\widetilde{Y}\). Therefore, the overall time complexity can be approximated as \(O(\text {MaxIteration} \times \max (d^3, \text {complexity of Eq.\,}7))\).

5 Ablation study

In an ablation experiment to assess the role of the constraint in our objective function (Eq. 2), we removed the constraint \(WX=\widetilde{Y}\) from the objective function and conducted experiments on all the datasets using all the evaluation metrics. Generally, as shown in Fig. 13, the results are degraded across various evaluation metrics for the top 100 features, with a step size of 10.

6 Conclusion

This paper proposes a novel semantic multi-label learning model based on an autoencoder. Our proposed method learns the projection matrix to map from the feature space to the semantic space back and forth. The semantic labels are predicted in the optimisation method because they are not explicitly available from the training samples. We further rank the feature weights in the learned project matrix for feature selection. The proposed method is simple and computationally fast. We demonstrate through extensive experiments that our method outperforms the state of the art. Furthermore, we demonstrate the proposed method’s efficiency in reconstructing the original data using the predicted labels.