1 Introduction

Feature extraction plays an important role in pattern recognition. As a powerful supervised feature extraction method, linear discriminant analysis (LDA) [1] has been successfully applied in many problems, such as face recognition [2, 3], text mining [4, 5], image retrieval [6, 7], gait recognition [8], and microarrays [9, 10].

However, classical LDA is a vector (or one-dimensional) 1D based method. When input data are naturally of matrix (or two-dimensional) 2D form, such as images, two issues may arise. First, converting 2D data to 1D data may produce high-dimensional vectors and hence may lead to the small sample size (SSS) problem [11]. For example, a 32× 32 face image corresponds to a 1024-dimensional vector. Second, during the transformation from 2D data to 1D data, the underlying spatial (structural) information is destroyed. Therefore, useful discriminant information may be lost [12, 13]. To handle these problems, many image-as-matrix methods have been developed [14, 15]. In contrast to the image-as-vector methods, the image-as-matrix methods treat an image as a two-order tensor, and their objective functions are expressed as functions of the image matrix instead of the high-dimensional image vector. The representative image-as-matrix method is two-dimensional LDA (2DLDA) [16]. 2DLDA constructed the within-class scatter matrix and between-class scatter matrix by using the original image samples represented in matrix form rather than converting matrices to vectors beforehand. Compared to LDA, 2DLDA can alleviate the SSS problem when a mild condition is satisfied [17] and can preserve the original structure of the input matrix.

Thereafter, some modifications and improvements of 2DLDA were studied by many researchers. Due to the squared L2-norm nature of 2DLDA, it was sensitive to noise and outliers. To improve the robustness of 2DLDA, robust replacements of the L2-norm were studied, including the L1-norm [18,19,20,21], nuclear norm [22, 23], Lp-norm [24, 25], and Schatten Lp-norm, 0 < p < 1 [26]. Some of the studies focused on extracting the discriminative transformations on both sides of the matrix samples. The authors in [27, 28] implemented 2DLDA on matrices in sequence or independently and then combined left-and right side transformations to achieve bilateral dimensionality reduction. Li et al. [25] used iterative schemes to extract transformations on both sides. Extensions to other machine learning problems and real applications were also investigated. For example, Wang et al. [29] proposed a convolutional 2DLDA for nonlinear dimensionality reduction, and Xiao et al. [30] studied a two-dimensional quaternion sparse discriminant analysis that met the requirements of representing RGB and RGB-D images.

Although 2DLDA can ease the SSS problem, it may still face the singularity issue theoretically as LDA since it needs to solve a generalized eigenvalue problem. Recently, a novel vector-based L2-norm linear discriminant analysis criterion based on Bhattacharyya error bound estimation (L2BLDA) [31] was proposed. Compared to LDA, L2BLDA solved a simple standard eigenvalue decomposition problem rather than a generalized eigenvalue decomposition problem, which avoided the singularity issue and had robustness. In fact, minimizing the Bhattacharyya error [32] bound is a reasonable way to establish classification [33]. In this paper, inspired by L2BLDA, to cope with the SSS problem and improve the robustness of 2DLDA, we first derive a Bhattacharyya error upper bound for matrix input classification and then propose a novel two-dimensional linear discriminant analysis by minimizing this Bhattacharyya error upper bound, called 2DBLDA. The proposed 2DBLDA has the following characteristics:

  • 2DBLDA is proposed for the novel two-dimensional matrix input problem. The 2DBLDA criterion is proven to be an upper bound of the theoretical framework of the Bhattacharyya error bound optimality. We have proved that optimizing this upper bound of the Bhattacharyya error can lead to an optimal discriminant direction. Therefore, the rationality of the 2DBLDA optimization problem is guaranteed theoretically.

  • The weighting constant of the between-class distance and the within-class distance of 2DBLDA is adaptive to the involved data that is calculated according to input data. This constant not only helps the objective of 2DBLDA achieve the minimum error bound but also makes the proposed 2DBLDA adaptive without tuning any parameters. By considering the above weighted between-class distance information, 2DBLDA could achieve robustness.

  • Unlike 2DLDA, 2DBLDA is solved effectively through a standard eigenvalue decomposition problem, which does not involve the inverse of a matrix and hence avoids the SSS problem.

  • To observe the discriminant ability of our method, we consider the accuracy of different databases, plot the variation of the accuracy with dimension reduction, and measure the reconstruction performance of the face image. The experimental results on image recognition and face reconstruction demonstrate the effectiveness of 2DBLDA.

The paper is organized as follows. Section 2 briefly introduces LDA, L2BLDA and 2DLDA. Section 3 proposes our 2DBLDA and gives the corresponding theoretical analysis. Section 4 compares 2DBLDA with its related approaches. Section 5 discusses the relationship between our 2DBLDA and related methods and analyses the experimental results. Finally, the concluding remarks are given in Section 6. The proof of the Bhattacharyya error upper bound of 2DBLDA is given in the Appendix.

The notations of this paper are given as follows. We consider a supervised learning problem in the d1 × d2-dimensional matrix space \(\mathbb {R}^{d_{1}\times d_{2}}\). The training dataset is given by T = {(X1,y1),...,(XN,yN)}, where \(\textbf {X}_{l}\in \mathbb {R}^{d_{1}\times d_{2}}\) is the l-th input matrix sample and yl ∈{1,...,c} is the corresponding label, l = 1,...,N. Assume that the i-th class contains Ni samples, i = 1,…,c. Then, we have \(\sum \limits _{i=1}^{c}N_{i}=N\). We further write the samples in the i-th class as {Xis}, where Xis is the s-th sample in the i-th class, i = 1,…,c, s = 1,…,Ni. Let \(\overline {\textbf {X}}=\frac {1}{N}\sum \limits _{l=1}^{N}\textbf {X}_{l}\) be the mean of all matrix samples and \({\overline {\textbf {X}}}_{i}=\frac {1}{N_{i}}\sum \limits _{s=1}^{N_{i}}\textbf {X}_{is}\) be the mean of matrix samples in the i-th class. For a matrix \(\textbf {Q}=(\textbf {q}_{1}, \textbf {q}_{2},\ldots ,\textbf {q}_{n})\in \mathbb {R}^{m\times n}\), its Frobenius norm (F-norm) ||Q||F is defined as \(||\textbf {Q}||_{F}=\sqrt {\sum \limits _{i=1}^{n}||\textbf {q}_{i}||_{2}^{2}}\). The F-norm is a natural generalization of the vector L2-norm on matrices.

2 Related work

2.1 Linear discriminant analysis

Linear discriminant analysis (LDA) finds a projection transformation matrix W such that the ratio of between-class distance to within-class distance is maximized in the projected space. For data in \(\mathbb {R}^{n}\), LDA finds an optimal \(\textbf {W}\in \mathbb {R}^{n\times r}\), rn, such that the most discriminant information of the data is retained in \(\mathbb {R}^{r}\) by solving the following problem:

$$ \underset{\textbf{W}}{\max}~~\frac{\text{tr}(\textbf{W}^{T}\textbf{S}_{b}\textbf{W})} {\text{tr}(\textbf{W}^{T}\textbf{S}_{w}\textbf{W})}, $$
(1)

where tr(⋅) is the trace operation of a matrix, and the between-class scatter matrix Sb and the within-class scatter matrix Sw are defined by

$$ \textbf{S}_{b}=\frac{1}{N}\sum\limits_{i=1}^{c}N_{i}({\overline{\textbf{x}}}_{i}-{\overline{\textbf{x}}})({\overline{\textbf{x}}}_{i}-{\overline{\textbf{x}}})^{T} $$
(2)

and

$$ \textbf{S}_{w}=\frac{1}{N}\sum\limits_{i=1}^{c}\sum\limits_{s=1}^{N_{i}}(\textbf{x}_{is}-{\overline{\textbf{x}}_{i}})(\textbf{x}_{is}-{\overline{\textbf{x}}_{i}})^{T}, $$
(3)

where \(\overline {\textbf {x}}_{i}\in \mathbb {R}^{n}\) is the mean of the samples in the i-th class, \(\overline {\textbf {x}}\in \mathbb {R}^{n}\) is the mean of the whole data, and \(\textbf {x}_{is}\in \mathbb {R}^{n}\) is the s-th sample of the i-th class. The optimization problem (1) is equivalent to the generalized problem Sbw = λSww, where λ≠ 0, with its solution W = (w1,…,wr) given by the first r largest eigenvalues of \((\textbf {S}_{w})^{-1}\textbf {S}_{b}\) in case Sw being nonsingular.

2.2 L2-norm linear discriminant analysis criterion via the Bhattacharyya error bound estimation

As an improvement over LDA, the L2-norm linear discriminant analysis criterion based on Bhattacharyya error bound estimation (L2BLDA) [31] is a recently proposed vector-based weighted linear discriminant analysis. In the vector space \(\mathbb {R}^{n}\), by minimizing an upper bound of the Bhattacharyya error, the optimization problem of L2BLDA is formulated as

$$ \begin{array}{ll} \underset{\textbf{W}}{\min}~~&-\frac{1}{N}{\sum}_{i<j}\sqrt{N_{i}N_{j}}||\textbf{W}^{T} (\overline{\textbf{x}}_{i}-\overline{\textbf{x}}_{j})||_{2}^{2}+{\varDelta}\sum\limits_{i=1}^{c} \sum\limits_{s=1}^{N_{i}}||\textbf{W}^{T}(\textbf{x}_{is}-\overline{\textbf{x}}_{i})||_{2}^{2}\\ \text{s.t.\ }& \textbf{W}^{T}\textbf{W}=\textbf{\textbf{I}}, \end{array} $$
(4)

where \(\textbf {W}\in \mathbb {R}^{n\times r}\), rn, \(P_{i}=\frac {N_{i}}{N}\), \(P_{j}=\frac {N_{j}}{N}\), \(\overline {\textbf {x}}_{i}\in \mathbb {R}^{n}\) is the mean of the samples in the i-th class, \(\textbf {x}_{is}\in \mathbb {R}^{n}\) is the s-th sample of the i-th class, \({\varDelta }=\frac {1}{4}\sum \limits _{i<j}^{c}\sqrt {P_{i}P_{j}}||{\overline {\textbf {x}}_{i}-\overline {\textbf {x}}_{j}}||_{2}^{2}\), and \(\textbf {\textbf {I}}\in \mathbb {R}^{r\times r}\) is the identity matrix.

L2BLDA is solved through the following standard eigenvalue decomposition problem:

$$ \begin{array}{ll} \underset{\textbf{W}}{\min}&~~\text{tr}(\textbf{W}^{T}\textbf{S}\textbf{W})\\ \text{s.t.\ }& \textbf{W}^{T}\textbf{W}=\textbf{I}, \end{array} $$
(5)

where

$$ \textbf{S}=-\frac{1}{N}\sum\limits_{i<j}\sqrt{N_{i}N_{j}}(\overline{\textbf{x}}_{i}- \overline{\textbf{x}}_{j})(\overline{\textbf{x}}_{i}-\overline{\textbf{x}}_{j})^{T}+{\varDelta} \sum\limits_{i=1}^{c}\sum\limits_{s=1}^{N_{i}}(\textbf{x}_{is}-\overline{\textbf{x}}_{i})(\textbf{x}_{is}-\overline{\textbf{x}}_{i})^{T}. $$
(6)

Then, W = (w1,w2,…,wr) is obtained by the r orthogonormal eigenvectors that correspond to the first r nonzero smallest eigenvectors of S. After obtaining the optimal W, a new sample \(\textbf {x}\in \mathbb {R}^{n}\) is projected into \(\mathbb {R}^{r}\) by WTx.

2.3 Two-dimensional linear discriminant analysis

Different from LDA or L2BLDA, which works on vector samples, two-dimensional linear discriminant analysis (2DLDA) [16, 17] operates on matrix samples. 2DLDA defines the between-class scatter matrix and the within-class scatter matrix directly on the 2D data set T as

$$ \textbf{S}_{b}=\frac{1}{N}\sum\limits_{i=1}^{c}N_{i}({\overline{\textbf{X}}}_{i}-{\overline{\textbf{X}}})({\overline{\textbf{X}}}_{i}-{\overline{\textbf{X}}})^{T} $$
(7)

and

$$ \textbf{S}_{w}=\frac{1}{N}\sum\limits_{i=1}^{c}\sum\limits_{s=1}^{N_{i}}(\textbf{X}_{is}-{\overline{\textbf{X}}}_{i})(\textbf{X}_{is}-{\overline{\textbf{X}}}_{i})^{T}. $$
(8)

Then 2DLDA solves the following optimization problem:

$$ \underset{\textbf{W}}{\max}~~\frac{\text{tr}(\textbf{W}^{T}\textbf{S}_{b}\textbf{W})}{\text{tr} (\textbf{W}^{T}\textbf{S}_{w}\textbf{W})}=\frac{\sum\limits_{i=1}^{c}N_{i}\|\textbf{W}^{T} ({\overline{\textbf{X}}}_{i}-{\overline{\textbf{X}}})\|_{F}^{2}}{\sum\limits_{i=1}^{c} \sum\limits_{s=1}^{N_{i}}\|\textbf{W}^{T}(\textbf{X}_{is}-{\overline{\textbf{X}}_{i}})\|_{F}^{2}}, $$
(9)

where \(\textbf {W}=(\textbf {w}_{1},\ldots ,\textbf {w}_{r})\in \mathbb {R}^{d_{1}\times r}\), rd1. i = 1,…,c, j = 1,…,Ni. (9) can be solved through the generalized eigenvalue problem Sbw = λSww in case Sw is nonsingular, and its solution is the r eigenvectors corresponding to the first largest r nonzero eigenvalues. After obtaining optimal W, a new sample \(\textbf {X}\in \mathbb {R}^{d_{1}\times d_{2}}\) is projected into \(\mathbb {R}^{r\times d_{2}}\) by WTX. Note that 2DLDA will still encounter the singularity problem when Sw is not of full rank.

3 Two-dimensional Bhattacharyya bound linear discriminant analysis

3.1 The derivation of a Bhattacharyya error bound estimation

In this section, we derive a new two-dimensional linear discriminant analysis criterion by minimizing a Bhattacharyya error bound.

From the viewpoint of minimizing the probability of classification error, the Bayes classifier is the best classifier [1], and its error rate, known as the Bayes error, is defined as

$$ \epsilon = 1- \int\underset{i\in\{1,2,\ldots,c\}}{max}\{P_{i}p_{i}(\textbf{X})\}d\textbf{X}, $$
(10)

where X is a sample, Pi is the prior probability, and pi(X) is the probability density function of the i-th class of the data. Computing the Bayes error is usually hard, and therefore minimizing its upper bound is often considered an alternative effective method [35,36,37]. Among various bounds, the Bhattacharyya error [32] is a close upper bound to the Bayes error, which is given by

$$ \epsilon_{B}=\sum\limits_{i<j}^{c}\sqrt{P_{i}P_{j}}\int\sqrt{p_{i}(\textbf{X})p_{j}(\textbf{X})}d\textbf{X}.\\ $$
(11)

Under the background of two-dimensional supervised dimensionality reduction, if we can derive a relatively close upper bound of 𝜖B, we may obtain a reasonable dimensionality reduction model. In fact, under some basic assumptions, we can obtain an upper bound of 𝜖B, as shown in the following proposition.

Proposition 1

Assume Pi and pi(X) are the prior probability and the probability density function of the i-th class for the training data set T, respectively, and the data samples in each class are independent and identically normally distributed. Let p1(X),p2(X),…,pc(X) be the Gaussian functions given by \(p_{i}(\textbf {X})=\mathcal {N}(\textbf {X}|{\overline {\textbf {X}}}_{i}, \boldsymbol {{\varSigma }}_{i})\), where \({\overline {\textbf {X}}}_{i}\) and Σi are the class mean and the class covariance matrix, respectively. We further suppose Σi = Σ, i = 1,2,…,c, where Σ is the covariance matrix of the data set T, and \({\overline {\textbf {X}}}_{i}\) and Σ can be estimated accurately from T. Then for arbitrary projection vector \(\textbf {w}\in \mathbb {R}^{d_{1}}\), the Bhattacharyya error bound 𝜖B defined by (11) on the data set \(\widetilde {T}=\{\widetilde {\textbf {X}}_{i}|\widetilde {\textbf {X}}_{i}=\textbf {w}^{T}\textbf {X}_{i}\in \mathbb {R}^{1\times d_{2}}\}\) satisfies the following:

$$ \begin{array}{@{}rcl@{}} \epsilon_{B} &\leq&-\frac{a}{8}\sum\limits_{i<j}^{c}\sqrt{P_{i}P_{j}}{||\textbf{w}^{T}({\overline{\textbf{X}}_{i}-\overline{\textbf{X}}_{j}})||_{2}^{2}}+\frac{a}{8}{\varDelta}{\sum}_{i=1}^{c}{\sum}_{s=1}^{N_{i}}||\textbf{w}^{T}(\textbf{X}_{is}-\overline{\textbf{X}}_{i})||_{2}^{2}\\ &&+\sum\limits_{i<j}^{c}\sqrt{P_{i}P_{j}}, \end{array} $$
(12)

where \({\varDelta }=\frac {1}{4}\sum \limits _{i<j}^{c}\sqrt {P_{i}P_{j}}||{\overline {\textbf {X}}_{i}-\overline {\textbf {X}}_{j}}||_{F}^{2}\), and a > 0 is some constant.

Proof

See the Appendix. □

3.2 The proposed two-dimensional Bhattacharyya bound linear discriminant analysis

Proposition 1 gives a reasonable upper bound of 𝜖B. After obtaining an upper error bound, it is natural to minimize it. Therefore, we minimize the upper bound of 𝜖B in (12), that is, the right side of (12). In fact, by minimizing it, we can easily obtain a novel two-dimensional Bhattacharyya bound linear discriminant analysis (2DBLDA) as follows:

$$ \underset{\textbf{w}^{T}\textbf{w}=1}{\min}~~-\frac{1}{N}\sum\limits_{i<j}\sqrt{N_{i}N_{j}}|| \textbf{w}^{T}(\overline{\textbf{X}}_{i}-\overline{\textbf{X}}_{j})||_{2}^{2}+{\varDelta}\sum\limits_{i=1}^{c} \sum\limits_{s=1}^{N_{i}}||\textbf{w}^{T}(\textbf{X}_{is}-\overline{\textbf{X}}_{i})||_{2}^{2} $$
(13)

where \({\varDelta }=\frac {1}{4}\sum \limits _{i<j}^{c}\sqrt {P_{i}P_{j}}||{\overline {\textbf {X}}_{i}-\overline {\textbf {X}}_{j}}||_{F}^{2}\), \(\textbf {w}\in \mathbb {R}^{d_{1}}\), \(P_{i}=\frac {N_{i}}{N}\).

By applying (13), we can project a d1 × d2 sample X into a 1 × d2 sample \(\widetilde {\textbf {X}}\) by \(\widetilde {\textbf {X}}=\textbf {w}^{T}\textbf {X}\). However, it does not usually contain enough discriminant information in the 1 × d2 space, and we may need r ≥ 1 projection vectors w1,w2,…,wr that constitute a projection matrix \(\textbf {W}=(\textbf {w}_{1}, \textbf {w}_{2},\ldots ,\textbf {w}_{r})\in \mathbb {R}^{d_{1}\times r}\) and project X into a r × d2 space by \(\widetilde {\textbf {X}}=\textbf {W}^{T}\textbf {X}\).

In general, we consider the following 2DBLDA

$$ \begin{array}{ll} \underset{\textbf{W}}{\min}~~&-\frac{1}{N}\sum\limits_{i<j}\sqrt{N_{i}N_{j}}||\textbf{W}^{T}(\overline{\textbf{X}}_{i}-\overline{\textbf{X}}_{j})||_{F}^{2}+{\varDelta}\sum\limits_{i=1}^{c}\sum\limits_{s=1}^{N_{i}}||\textbf{W}^{T}(\textbf{X}_{is}-\overline{\textbf{X}}_{i})||_{F}^{2}\\ \text{s.t.\ }& \textbf{W}^{T}\textbf{W}=\textbf{\textbf{I}}, \end{array} $$
(14)

where \(\textbf {W}\in \mathbb {R}^{r\times d_{1}}\), rd1. We now give the geometric meaning of 2DBLDA. Minimizing the first term in (14) will make the means of two different classes far from each other in the projected space, which guarantees the between-class separativeness. Here, the coefficients \(\frac {1}{N}\sqrt {N_{i}N_{j}}\) in the first term weight distance pairs between different class means. Minimizing the second term in (14) forces each sample around its own class mean in the projected space. The weighting constant Δ in front of the second term balances the between-class importance and within-class importance while also ensuring a minimum error bound according to the proof of Proposition 1. We can observe that 2DBLDA is adaptive to different data since Δ is determined by the given data set. To ensure minimum redundancy in the projected space, we also consider an orthogonormal constraint WTW = I on discriminant directions.

It is easily seen that we can solve 2DBLDA through the following standard eigenvalue decomposition problem:

$$ \begin{array}{ll} \underset{\textbf{W}}{\min}&~~tr(\textbf{W}^{T}\textbf{S}\textbf{W})\\ \text{s.t.\ }& \textbf{W}^{T}\textbf{W}=\textbf{I}, \end{array} $$
(15)

where

$$ \textbf{S}=-\frac{1}{N}\sum\limits_{i<j}\sqrt{N_{i}N_{j}}(\overline{\textbf{X}}_{i}- \overline{\textbf{X}}_{j})(\overline{\textbf{X}}_{i}-\overline{\textbf{X}}_{j})^{T}+ {\varDelta}\sum\limits_{i=1}^{c}\sum\limits_{s=1}^{N_{i}}(\textbf{X}_{is}-\overline{\textbf{X}}_{i}) (\textbf{X}_{is}-\overline{\textbf{X}}_{i})^{T}. $$
(16)

Then, we obtain the optimal solution as \(\textbf {W}=\left (\textbf {w}_{1},\textbf {w}_{2},\ldots ,\textbf {w}_{r}\right )\), where w1,w2,…,wr are the r orthogonormal eigenvectors corresponding to the first r smallest nonzero eigenvectors of S.

4 Experiments

In this section, we compare the proposed 2DBLDA with 2DPCA [34], 2DPCA-L1 [12], 2DLDA [16] and L1-2DLDA [18, 19]. The learning parameter δ of L1-2DLDA is selected optimally from the set {0.001,0.005,0.01,0.05,0.1,0.5,1} by grid search.

We experimented on three image databases for image recognition and one face image database for face reconstruction. In the experiments, after applying a dimensionality reduction method on training data and then obtaining a projection matrix, the test data are projected to lower dimensional space using this projection matrix. For image recognition, the nearest neighbours classifier is employed to obtain classification accuracy. In addition, when the data classes are unbalanced, area under the ROC curve (AUC) and G-mean are used as the performance measurement index. For face reconstruction, the mean reconstruction error is used for performance evaluation. All the methods will be carried out on a PC with P4 2.3 GHz CPU by Matlab 2017b.

4.1 Image recognition

4.1.1 Performance on three image databases

The Yale databaseFootnote 1 is a human face database that contains 165 images of 15 individuals, and each individual includes 11 images. The database is considered to evaluate the performance of methods when facial expression and lighting conditions are changed.

Columbia Object Image Library (Coil100)Footnote 2 is a database of colour images of 100 objects. The objects were placed on a motorized turntable against a black background. The turntable was rotated 360 degrees to vary object pose with respect to a fixed colour camera. Images of the objects were taken at pose intervals of 5 degrees. The database contains 900 images of 100 objects, with each object containing 9 images.

The COVID databaseFootnote 3 has 349 CT images containing clinical findings of COVID-19 from 216 patients and 397 non-COVID CT scans. The images are collected from COVID-19 related papers from medRxivFootnote 4, bioRxivFootnote 5, Lancet, etc. In our experiment, 195 COVID-19 images and 195 non-COVID-19 images were randomly extracted.

We resize each image to 32 × 32 for all the above three databases. Since the number of samples in some classes of the image data used in the experiment is relatively small, to avoid the chance that images in these classes may not be selected due to random cross-validation, for each class, we randomly select 60% of the data samples as the training set, and deem the rest as the test set. Therefore, this strategy makes sure both train and test data set contain samples from every class. First, we obtain all projection matrices from the training data and then compute the test classification accuracy on the projected test data. Since 2DPCA, 2DLDA and 2DBLDA have no parameters, the result of one run is the final result. For L1-2DLDA, there is one parameter, and we adopt ten-fold cross validation on the training set to find its optimal parameter. Then, this optimal parameter is used to run L1-2DLDA ten times on the test set to eliminate the influence of random initialization, and the average accuracy of these ten accuracies is adopted. Similarly, for 2DPCA-L1, since its performance is affected by the initialization projections, we repeat the method ten times and adopt mean accuracy along with standard variance. The results on these databases are listed in Table 1, and the best accuracies are shown in bold figures. From the table, we see that our 2DBLDA has comparable performance compared to other methods. The 2DPCA-L1 and L1-2DLDA obviously have the highest computational burden. In contrast, 2DBLDA costs the least CPU time than 2DPCA and 2DLDA.

Table 1 Comparison of mean accuracy (%), CPU time (second) for different methods on the original three databases

To further see the superiority of our 2DBLDA, we artificially pollute the training data by adding each training sample with a rectangle block occlusion at a random location. We set the occlusion area ratio to 10%,20%,30%,40%. For convenience, we denote these four data sets as Yaleb0.1, Yaleb0.2, Yaleb0.3 and Yaleb0.4, where the subscript “b” represents block occlusion and the number next to it means occlusion ratio. For the Coil100 data and COVID data, we add random rectangular Gaussian noise of mean 0 and variance 0.2 that covers 10%,20%,30%,40% areas of each training image at a random position. Denote these eight data sets as Coilg0.1, Coilg0.2, Coilg0.3, Coilg0.4, COVIDg0.1, COVIDg0.2, COVIDg0.3 and COVIDg0.4, where the subscript “g” represents Gaussian noise, and the number next to it means noise ratio. Some noise samples are shown in Fig. 1.

Fig. 1
figure 1

Noise samples from the three databases

The classification results on the noise datasets are listed in Table 2. From the table, we have the following observations : (i) All methods are affected by noise, and their corresponding accuracies are lower than those of the original data. In general, the larger the noise area is, the lower the accuracy is. (ii) The proposed 2DBLDA has the highest average accuracy on all noise data. (iii) L1-2DLDA and 2DPCA perform better than 2DPCA-L1 and 2DLDA. (iv) L1-2DLDA can achieve the optimal accuracy when δ is relatively small. (v) For CPU time, we see that 2DPCA-L1 and L1-2DLDA have the same computing time level but are all slower than 2DPCA and 2DLDA, and that 2DLDA and 2DBLDA run the fastest since they obtain all the discriminant vectors once for all.

Table 2 Comparison of mean accuracy (%), CPU time (second) for different methods on noise databases

4.1.2 The influence of the reduced dimension

To observe the discriminant ability of the dimensionality method, we measure feature ranking by observing the effect of sample classification in projection space and plot the accuracy variation along with the reduced dimensions in Figs. 2 and 3. Figure 2 depicts the variation of accuracies along dimensions on the original three databases, and Fig. 3 depicts the corresponding results on noise databases.

Fig. 2
figure 2

Accuracies of all methods on the original three databases

Fig. 3
figure 3

Accuracies of all methods on three databases with different levels of noise

The results show the following: (i) With the increase of the number of reduced dimensions, the accuracies of 2DPCA and our 2DBLDA first achieve their highest and then have a relatively steady trend, while other methods vary greatly. (ii) Regardless of on the original data or the noise data, the proposed 2DBLDA has the highest accuracy under the optimal reduced dimension. (iii) All the methods are greatly influenced by the reduced dimension, and it is necessary to choose an optimal reduced dimension. (iv) In addition, the optimal reduced dimension of 2DBLDA is not too large compared to other methods in general.

4.1.3 The influence of the unbalanced classes

In this subsection, we verify the influence of our algorithm on unbalanced classes. To construct unbalanced data, different numbers of images are randomly selected from each class to form the training set, and the remaining data are deemed as the test set. In specific, for the COVID database, we randomly select 60% of the sample number for each class from COVID-19 images and non-COVID-19 images in a ratio of 1:1.5 as the training set. Notably, the training set and test set we construct are unbalanced. To test the robustness, as before, we pollute the training images with a black rectangular block, which covers 10%, 20%, 30% and 40% of each image at a random position. In this situation, we use AUC and G-mean to measure the performance of all methods, which are both designed for unbalanced data. The results on original databases and noise databases are demonstrated in Figs. 4 and 5. From Figs. 4 and 5, we can see that the proposed 2DBLDA has the highest AUC and G-mean of all databases. Though the larger the noise area is the lower the performance is for all algorithms, when the block percentage increases, 2DBLDA and L1-2DLDA are less affected by noise, while the performance of other methods decreases dramatically and the proposed 2DBLDA is the best. The result is in fact consistent with the formulation of 2DBLDA, where its weighted between-class distance information and weighting constant of the between-class distance and the within-class distance make contribution to its good performance on unbalanced problems. The result also shows that compared to other methods, our 2DBLDA is more adaptive and robust to different data.

Fig. 4
figure 4

AUC of all methods on different databases

Fig. 5
figure 5

G-mean of all methods on different databases

4.2 Face Reconstruction

In this part, the proposed 2DBLDA and other methods are applied to face reconstruction on the Indian female database. The Indian females database contains 242 human face images of 22 female individuals, and each individual has 11 different images. The original images are resized to 32×32 pixels.

We introduce face image reconstruction. For a given image \(\textbf {X}\in \mathbb {R}^{d_{1}\times d_{2}}\), suppose we have obtained a projection matrix \(\textbf {W}=\left (\textbf {w}_{1},\textbf {w}_{2},\ldots ,\textbf {w}_{r}\right )\in \mathbb {R}^{d_{1}\times r}\), rd1. Then X is projected into the r × d2-dimensional space by \(\widetilde {\textbf {X}}=\textbf {W}^{T}\textbf {X}\). Since w1,w2,…,wr are orthonormal, then the reconstructed image of X can be obtained by \(\widehat {\textbf {X}}=\textbf {W}\widetilde {\textbf {X}}=\textbf {W}\textbf {W}^{T}\textbf {X}\). To measure the reconstruction performance, we use the average reconstruction error (ARE) as a performance indicator, which is defined as

$$ \bar{e}_{r}=\frac{1}{N}\sum\limits_{i=1}^{N}||\textbf{X}_{i}-\textbf{W}\textbf{W}^{T}\textbf{X}_{i}||_{F}, $$
(17)

where r = 1,2,…,d1.

We first experiment on the original data and compute the ARE for each method. The variation in ARE along different dimensions is shown in Fig. 6 (a). From the figure, we see that when the dimension is less than 15, our 2DBLDA performs the best, especially when the dimension is greater than 5. When the dimension is greater than 15, 2DPCA is comparable or slightly better than our 2DBLDA, but both of these methods almost achieve steady performance. The result shows that 2DBLDA can achieve good performance for low dimensions. The other three methods obviously perform worse than our 2DBLDA and 2DPCA on all the dimensions. When r = 15, we demonstrate the reconstructed face images for 7 random individuals in Fig. 6b. We can visually see that 2DBLDA and 2DPCA have the best reconstruction performance.

Fig. 6
figure 6

Face reconstruction results on different databases

To further evaluate the effectiveness of the proposed 2DBLDA, we add two different types of noise to the data. The first type of noise is Gaussian noise with mean 0 and variance 0.05 that covers 30% of the area of each image. The ARE of each method under different dimensions is plotted in Fig. 6c. On Gaussian noise data, we see that our 2DBLDA outperforms other methods on almost all the reduced dimensions, and 2DPCA is comparable to our 2DBLDA only when the dimension is greater than 27, indicating that the proposed 2DBLDA can achieve fairly good performance by employing only a small number of reduced dimensions. We then add the second type of noise, dummy noise, to the data. Here, the dummy noise is the image that is generated from the discrete uniform distribution on [0,1] and is of the same size as the original image. An additional 100 dummy images are added to the whole database. After the projection matrix is obtained on these polluted data, it is used to reconstruct human face images. The result in Fig. 6e demonstrates that our 2DBLDA has the lowest ARE on these databases for all the dimensions, and when the dimension is greater than 20, it has a rather low ARE. The reconstructed face images when r = 15 shown in Fig. 6f also support the above argument.

5 Discussion

To further clarify the contribution of our method, we discuss the differences between the proposed 2DBLDA and its two closely related methods, RLp2DLDA and L2BLDA, and give a detailed analysis of the experimental results.

5.1 Relationship between RLp2DLDA, L2BLDA and 2DBLDA

  1. (i)

    Difference From RLp2DLDA: The formulation of 2DBLDA is different from any existing 2D linear discriminant analysis method, and the 2DBLDA criterion is derived by minimizing an upper bound of the theoretical framework of the Bhattacharyya error bound optimality. Although robust bilateral Lp-norm two-dimensional linear discriminant analysis (RLp2DLDA) is also derived from some upper bound of the Bhattacharyya error, they have different formulations since they have different error bounds. In fact, the bound for 2DBLDA may be closer than the bound of RLp2DLDA, which can be observed from two aspects. First, when deriving its bound, RLp2DLDA ignores the term \(\sqrt {P_{i}P_{j}}\) and replaces it by 1, which obviously magnifies the upper bound. In contrast, our 2DBLDA keeps this term and fully explores this weighting information, which leads to one of good properties of 2DBLDA, that is, robustness. Second, RLp2DLDA also magnifies its upper bound when using the Lp-norm (0 < p < 1) rather than the L2-norm. Therefore, this results in two advantages of our 2DBLDA over RLp2DLDA: one is that 2DBLDA obtains a meaningful weighting parameter that does not need tuning, and the other is that 2DBLDA can simply solve its optimization problem through a standard eigenvalue problem, while RLp2DLDA solves its optimization problem through an iteration technique without proving its convergence.

  2. (ii)

    Difference From L2BLDA: Compared to the vector-based robust Bhattacharyya bound linear discriminant analysis through an adaptive algorithm (L2BLDA), the proposed 2DBLDA is a matrix-based dimensionality reduction method. Although 2DBLDA is a generalization of L2BLDA, it is not so direct from view of the derivation of its upper bound. In fact, the derivation procedure of the Bhattacharyya error bound of 2DBLDA is not exactly the same as that of L2BLDA. In addition, 2DBLDA can more effectively deal with the matrix input without vectoring it first, which improves the computing efficiency, especially when computing the scatter matrices.

5.2 Experimental results summary

  1. (i)

    To study the performance of 2DBLDA, we give the variation of accuracies under different databases and different noise levels. The time of 2DBLDA is also investigated in Tables 1-2. Experimental results show that 2DBLDA runs fast and improves the robustness of 2DLDA.

  2. (ii)

    To compare the behavior of 2DBLDA and other related methods under different reduced dimensions, we plot the accuracy variation along with the reduced dimensions in Figs. 2-3. The results demonstrate that compared to other methods, the proposed 2DBLDA obtains better classification results under its optimal reduced dimension.

  3. (iii)

    To see the application ability of 2DBLDA in unbalanced classes, we experiment on three original and different noise image databases. From the results in Fig. 4 and Fig. 5, we see that the proposed 2DBLDA has the best performance compared to other methods.

  4. (iv)

    To observe the behavior of the proposed method visually, we reconstruct face images by the obtained projection matrix. Original and polluted Indian female databases are used for face reconstruction. By choosing an appropriate reduced dimension but not necessarily too large, the proposed 2DBLDA can obtain good face reconstruction performance.

6 Conclusion

This paper proposed a novel two-dimensional linear discriminant analysis via Bhattacharyya upper bound optimality (2DBLDA). Different from the existing 2DLDA, optimizing the criterion of 2DBLDA was equivalent to optimizing the upper bound of the Bhattacharyya error, leading to maximizing a weighted between-class distance and minimizing the within-class distance, where these two distances were weighted by a meaningful adaptive constant that can be computed directly from the involved data. The 2DBLDA had no parameters to be tuned and could be effectively solved by a standard eigenvalue decomposition problem. Experimental results on image recognition and face image reconstruction demonstrated the superiority of the proposed method. Our MATLAB code can be downloaded from http://www.optimal-group.org/Resources/Code/2DBLDA.html.

However, a drawback of 2DBLDA is that its classification performance degrades when the class distribution of the samples is inconsistent. A TAISL technique could be used to handle this issue [38]. Since sparse learning could make the data have better interpretation after dimensionality reduction [20], one of the future studies also includes considering a sparse model. In the end, applying our algorithm to track fault detection is worth studying [39, 40].