Introduction

Linear discriminant analysis (LDA) is a method from multivariate statistics which attempts to find a linear projection of high-dimensional observations onto a lower-dimensional space [10]. It finds the optimal decision boundaries in the resulting lower dimensional subspace. LDA is an efficient way to separate the features on the basis of class information, but since it requires inverse operation it often becomes problematic if the dimension becomes very high as compared to the number of available training samples. Thereby it ignores the eigenvectors corresponding to zero eigenvalues so as to have the within class scatter matrix non-singular. In Sharma et al. [29], an improved regularized LDA is proposed which is carried out by adding a perturbation term \(\alpha\) to the diagonal elements of within class matrix to make it non-singular and invertible. However, the eigenvectors corresponding to zero eigenvalues also contain the important class discriminatory information as reported in [6, 17, 19, 27]. Thus, we aim to utilize both the principal as well as nullspace eigenvalues and extend the beneficial properties of the proposed regularized fisher method (low intra-class variability, high total-class variability, optimal decision boundaries). This is done by reformulating its objective to learn linearly separable representations based on a deep neural network (DNN) for both binary as well as multi-class problem.

LDA is used widely as a supervised dimensionality reduction method in computer vision and pattern recognition. Its recent generalization to non-Euclidean Grassmann manifolds can be found in [33]. This aims to impose the highest possible variance among classes, by maximizing the between-class distances, whilst minimizing the within-class scattering. Recently, deep learning combined with various multivariate statistics methods have achieved great success [12]. Andrew et al. [4] introduced a deep canonical correlation analysis (DCCA) which can be viewed as a non-linear extension of CCA . In their evaluations, they argued that DCCA learns representations with significantly higher correlation than those learned by CCA and Kernel (non-linear) CCA. They experimented using the MNIST handwritten data and simultaneous recording of articulatory and acoustic data. Ghassabeh et al. [13] presents new adaptive algorithms for online feature extraction using principal component analysis (PCA) and LDA for classification purpose. In Al-Waisy et al. [2], they have merged the advantages of local handcrafted feature descriptors with the Deep Belief Networks for the face recognition problem in unconstrained conditions and have obtained better performances.

PCANet proposed by Chan et al. [5] which includes cascading of PCA, binary hashing and block histogram computations. This can be seen as an unsupervised convolutional deep learning approach. Due to computational complexity these multi-stage filter banks are limited to two stages but can be extended to any number. They also experimented further modifications on PCANet as RandNet and LDANet. RandNet and LDANet share the same methodology like PCANet, but their cascaded filters are either selected randomly as in RandNet or learned from LDA in case of LDANet. Lifkooee et al. [24] combines regular deep convolutional neural network with the Laplacian of Gaussian filter (LoG) right before fully connected layer and they have shown that the proposed feature descriptor along with LoG introduced in CNN further improves the performance of deep learning.

Stuhlsatz et al. [31] initially proposed the idea of combining LDA with neural networks. In their proposed approach, they pre-train a stack of restricted Boltzmann machines and this pre-trained model is finetuned with respect to a linear discriminant criterion. LDA has the disadvantage that it overemphasizes large distances at the cost of confusing neighbouring classes. Thus, to tackle this problem, they introduced a heuristic weighing scheme for computing the within-class scatter matrix required for LDA optimization. The LDA based objective function proposed by Dorfer et al. [9] is a non-linear extension of classic LDA where the objective function is obtained from the general LDA eigenvalue problem while still allowing to train the CNN architecture with stochastic gradient descent and back-propagation.

In this paper, we propose to modify the LDA based objective function which would utilize both the principal as well as nullspace eigenvalues onto its latent space representation for both multi-class as well as binary class problem. Extensive experimental results on multiple computer vision classification tasks illustrates the superiority of our proposed approach as compared to other popular methods. Below, we describe our proposed method in details.

Proposed Approach

The approaches mentioned so far are based on the study of multi-variate statistics. In our work, we propose to train a CNN architecture in an end-to-end fashion with a new objective function which would enable the network to inherit the property of maximizing the total variation and minimizing the within class variation.

Deep Learning has become state-of-the-art for many image based applications of classification, object recognition, segmentation, image captioning and natural language processing [14, 26]. The mathematical model of Convolutional Neural Network (CNN) is explained by Kuo et al. [22] where the fundamental questions about the structure of the convolutional neural networks is explained. There are many variations of deep convolutional neural networks for various vision tasks. The intuition behind our approach is to use the proposed regularized Fisher method as the objective function on top of a powerful feature learning model. The optimization of parameters is carried by back-propagating the error of the proposed objective function through the entire network. One of our objectives in this work is to come up with a CNN architecture that can be generically applied to many computer vision classification tasks. For experimental evaluation, we evaluated our proposed objective function on various benchmark databases like MNIST (handwritten digit recognition), CIFAR-10 (natural image classification) and ISBI (skin cancer detection into melanoma and non-melnoma cases) to show that the objective function is effective for both multi-class as well as binary class classification problems.

Deep Regularized Discriminative Network over simple ConvNet

Deep learning networks are different from the simple single-hidden-layer neural networks by their depth. Deep-learning networks effectively learn the features automatically without human intervention, unlike most traditional machine-learning algorithms. A neural network with P hidden layers is represented as a non-linear function \(f(\Theta )\), where \(\Theta =\{\Theta _{1},\ldots ,\Theta _{P}\}\). In supervised learning for N number of samples, we have \(x=\{x_{1},\ldots,x_{N}\}\) as training data and \(y=\{y_{1},\ldots ,y_{N}\}\in {1,\ldots ,C}\), where C is the number of classes. In the last layer, we have softmax as the classifier which gives the normalized probability of the data that belongs to a particular class. The output, \(o_{i}=\{o_{i1},\ldots ,o_{iC}\}\) is a function of \(f(x_{i}, \Theta )\). The network is optimized using stochastic gradient descent or any other optimizer like Adam with the goal of finding optimal model parameters \(\Theta\) by minimizing the objective function \(l_{i}(\Theta )\):

$$\begin{aligned} \Theta =\underset{\Theta }{{\text {argmin}}}\frac{1}{N}\sum _{i}^{N}l_{i}(\Theta ), \end{aligned}$$
(1)

where \(l_{i}(\Theta )=f((x_{i},\Theta ),y_{i})\). For categorical cross entropy (CCE), the loss function is defined as

$$\begin{aligned} l_{i}(\Theta )=-\sum _{i}^{C}y_{i,j}\log (p_{i,j}), \end{aligned}$$
(2)

where \(p_{i,j}\) is the network output probability and \(y_{i,j}\) is 1 if observation \(x_{i}\) belongs to class \(y_{i}\) for \((j = y_{i})\) and 0 otherwise. Figure 1 shows the deep regularized network where the objective is different from the CCE in maximizing the total scatter matrix eigenvalues and minimizing the within class scatter matrix eigenvalues. In the following subsections, detail description of the proposed objective function and the related analysis are discussed.

Fig. 1
figure 1

Schematic sketch of deep regularized discriminative network which learns the linear separability property in the latent representation. Here the objective is to maximize the eigenvalues so that the class separability also increases

Proposed Objective Function

Linear discriminant analysis tries to find out the axes which maximize the between-class scatter matrix \(S_{\text {b}}\), while minimizing the within-class scatter matrix \(S_{\text {w}}\) in the projective subspace \(A \in \mathbb {R}^{l\times d}\). The projective subspace is a lower dimensional subspace, i.e., \(l=C-1\) where C is the number of classes. The resulting projection matrix onto this subspace \(x_{i}A^{T}\) are maximally separated in this space [10]. Fisher criterion is defined as the ratio of between-class and within-class variances, given by

$$\begin{aligned} J(W)=\frac{|W^{T}S_{\text {b}}W |}{|W^{T}S_{\text {w}}W |}. \end{aligned}$$
(3)

Here, W is the weight vector. To compute the within class scatter matrix:

$$\begin{aligned} S_{\text {c}}= & {} \frac{1}{(N_{\text {c}}-1)}\sum \bar{X_{\text {c}}}^{T}\bar{X_{\text {c}}}, \end{aligned}$$
(4)
$$\begin{aligned} S_{\text {w}}= & {} \frac{1}{(C)}\sum S_{\text {c}}. \end{aligned}$$
(5)

The total scatter matrix is computed using

$$\begin{aligned} S_{\text {t}}=\frac{1}{N-1}\bar{X}^{T}\bar{X}, \end{aligned}$$
(6)

where X is the input data matrix; in our case it would be the output of the CNN model and \(N_{\text {c}}\) is the sample numbers in that particular class. N is the total samples and \(\bar{X_{\text {c}}}=X_{\text {c}}-m_{\text {c}}\), \(m_{\text {c}}\) is the mean of that class, \(\bar{X}=X-m\) where m is the total mean of the samples. The output predicted values from the CNN model (y_pred) is used as X values for the computation of \(S_{\text {w}}\), as in (5). To extract discriminative features, at first we perform eigen decomposition of the within-class scatter matrix \(S_{\text {w}}\), given by

$$\begin{aligned} S_{\text {w}}=\Phi \Lambda \Phi ^{T}. \end{aligned}$$
(7)

Here, \(\Phi\) contains the eigenvectors and \(\Lambda\) are the eigenvalues of \(S_{\text {w}}\). Then the eigenvectors are sorted according to the eigenvalues in descending order. Matrix \(\Phi\) is then split into \(W_1\) and \(W_2\), where \(W_1\) is the matrix which contains the eigenvectors corresponding to those eigenvalues which are greater than a certain minimum variance. For our experimentation, we took minimum variance value as 1e−2. \(W_2\) matrix are the eigenvectors corresponding to those eigenvalues whose variance are less than the minimum variance. \(W_1\) matrix is divided with the square root of the corresponding eigenvalues and \(W_2\) matrix is divided with the square root of the minimum eigenvalues. These two matrices are concatenated to form \(\Psi\) as shown in (8) and it is multiplied with the y_pred to form the model output y:

$$\begin{aligned} \Psi= & {} [\Phi _{i}(\Lambda _{i})^{-\frac{1}{2}} \quad \Phi _{i}(\Lambda _{\text {smallest}})^{-\frac{1}{2}}]. \end{aligned}$$
(8)
$$\begin{aligned} y= & {} \Psi ^{T}{} \texttt {y\_pred} \end{aligned}$$
(9)

Then, we compute the total scatter matrix \(S_{\text {t}}\) using (6). After computing the covariance matrix, the projection matrix \(\Omega\) is selected by eigen decomposition of \(S_{\text {t}}\) and selecting the eigenvectors in \(\Phi _{\text {w}y}\) according to the most significant eigenvalues \(\Lambda _{\text {w}y}\). Eigen decomposition of \(S_{\text {t}}\) is given by

$$\begin{aligned} S_{\text {t}}=\Phi _{\text {w}y}\Lambda _{\text {w}y}\Phi _{\text {w}y}^{T}. \end{aligned}$$
(10)

Using the eigenvalues of \(S_{\text {t}}\) matrix, we formulate the objective as,

$$\begin{aligned} \underset{\Theta }{{\text {argmax}}}\frac{1}{C-1}\sum _{i}^{C-1}\Lambda _{\text {w}y}. \end{aligned}$$
(11)

The objective of combining this with the deep neural net is that of maximization of the individual eigenvalues of \(S_{\text {t}}\) and minimization of the eigenvalues of \(S_{\text {w}}\). In particular we expect maximization (minimization) of the eigenvalues of \(S_{\text {t}}\) (\(S_{\text {w}}\)) leads to maximizing (minimizing) separation in the respective eigenvector direction. Thus we would achieve the target of minimizing the within-class variation and maximizing the total variation. Deep neural network with categorical cross entropy (CCE) or binary cross entropy loss function does not take into account this aspect of discriminatory power. CCE main objective is to maximize the likelihood of the class labels according to the target labels.

Here the objective function is designed to consider only the k eigenvalues that do not exceed a certain threshold for variance maximization:

$$\begin{aligned}&\underset{\Theta }{{\text {argmax}}}\frac{1}{k}\sum _{i}^{k}v_{i}\quad with (v_{i},\ldots ,v_{k}),\nonumber \\&\quad = \{v_{j}\vert v_{j}< min\{v_{i},\ldots ,v_{n-1}\}+\epsilon \}, \end{aligned}$$
(12)

where for symbol easiness we have considered \(\Lambda _{\text {w}y}\) as v and n is the rank of the covariance matrix which is equal to one less than the number of samples (\(n-1\)). This formulation of objective function allows to train the deep networks with backpropagation in end-to-end fashion. This is similar to the classic LDA but it lifts the constraint that generally occurs for binary classification where C (number of classes) is 2 and the l-dimensional projection matrix with classic LDA method will be \(l=C-1\), i.e., \(2-1=1\). The above proposed objective function can be used for both multi-class as well as binary class classification problems.

Experimental Results

One of the key objectives of our work is to propose a CNN architecture that can be generically applied to many vision tasks. For our experimental evaluation we considered four publicly available databases, namely MNIST (hand written digits recognition), CIFAR-10 (natural scenes classification), ISBI 2016 (skin cancer classification) and ISBI 2017 (skin cancer classification). We compare our results with various other similar approaches available for vision classification.

Databases

  • MNIST [23]: The MNIST or handwritten digits database consists of a 60,000 training set examples, and 10,000 testing set examples. The images have been size-normalized and centered to a defined size of \(28\times 28\) gray scale images. The database is freely available to public under a Creative Commons Attribution-Share Alike 3.0 license.

  • CIFAR-10 [21]: The CIFAR-10 database is freely obtained under MIT licensing (MIT), used for object recognition application is an established computer-vision database which consists of 60000 \(32\times 32\) colour images in 10 classes, with 6000 images per class. There are a total of 50000 training images and 10000 test images.

  • ISBI 2016 [16]: The ISIC archive, containing training database of 900 images of dermoscopic lesion and 369 in testing database in JPEG format, obtained under CC0 licensing. From leading clinical centers internationally, these images have been collected that are acquired from various devices used at each center. It has both natural (skin hairs, veins) as well as man-made artifacts which becomes difficult to classify without pre-processing.

  • ISBI 2017 [7]: International skin imaging collaboration (ISIC) is an international effort to improve melanoma diagnosis. In 2017 challenge, the database consists of more images in number as compared to 2016 including Seborrheic keratosis, a benign skin tumor derived from keratinocytes (non-melanocytic) along with benign nevus (melanocytic) and melanoma (melanocytic). The training data consists of 2000 images (374 melanoma, 254 seborrheic keratosis and 1626 benign nevus) and testing data consists of 600 images (117 melanoma images), all obtained under CC0 licensing. This is the largest among all state-of-the-art melanoma databases.

Experimental Setup

The general structure of the CNN model is based on VGG model using \(3\times 3\) convolutions [30]. We experimented with and without including the BatchNormalization layer after each convolutional layer [18]. This layer helps in increasing the convergence speed and also the performance of the model. For non-linearity RELU is used, since it greatly accelerate the convergence rate of stochastic gradient descent or any other optimizer as compared to the sigmoid/tanh functions [20]. All the networks are trained using Adam optimizer, but the learning rate is decreased to half after every 200 epochs. The batch size for MNIST data and CIFAR-10 is 1000 and for ISBI 2016 and ISBI 2017, the batch size is 400, as the training data is quite small in case of ISBI databases.

Related methods show that mini-batch learning on distribution parameters (in this case covariance matrices) is feasible if the batch-size is sufficiently large to be representative for the entire population [32]. Even though a large batch size is required to have stable estimates, it is limited by the data availability, image size and memory available on the GPU. Table 1 shows detail CNN model specifications for the CIFAR-10 and MNIST databases. The total number of trainable parameters for CIFAR-10 model is 5,752,414 and MNIST is 467,486. In all our experiments, the proposed method is validated with the existing ones using the same corresponding datasets and protocols. They are implemented on a system with Intel Core i7 processor, 16GB RAM, and NVIDIA GeForce GTX-1050Ti GPU card.

Table 1 Our proposed CNN model specifications for CIFAR-10 and MNIST databases

Results and Discussion

MNIST

The MNIST database consists of \(28\times 28\) gray scale image with labels as 0 to 9. The data structure consists of 60,000 samples of which 50,000 is training data and 10,000 is validation data. The test sample consists of 10,000 images, same protocol as that in [9]. Since the proposed method requires large batch size, thus for MNIST we took 1000 as the batch size. The optimizer is the Adam optimizer and the initial learning rate is reduced to half for every 200 epochs. For final classification, we use the linear support vector machine (SVM) classifier.

Table 2 Comparison of test errors (%) on MNIST database using our proposed approach and other relevant methodologies
Fig. 2
figure 2

a Shows the evolution of mean eigenvalues of \(S_{\text {t}}\) with respect to epoch number, b depicts the minimization of within class scatter matrix \(S_{\text {w}}\) with respect to epoch, on MNIST database

Table 2 shows the comparison of our proposed approach as compared to various relevant methods on MNIST database. From the results, it can be seen that our proposed method with new cost function is second best and comparable with the other state of-the-art reported performances. Therefore, it is evident that adding the latent space representation into the cost function, by maximizing the between-class and minimizing the within-class eigen representation efficiently learns the features required for classification. Thus the training is done in an unsupervised manner and using linear SVM, we do the final classification using the testing data.

Figure 2a shows the evolution of mean eigenvalues of the total scatter matrix with varying epochs during the training. Figure 2b shows the eigenvalues of within class scatter matrix with respect to varying epochs, which initially increases but later decreases; thus achieving our objective of minimizing the within class and maximizing the total variation among different classes, as shown in Fig. 2a, b.

Table 3 Comparison of test accuracy (%) on CIFAR-10 database using our proposed approach and other relevant methodologies

CIFAR-10

The CIFAR-10 database consists of \(32\times 32\) size image containing 10 different classes. The database structure consists of 50,000 training samples and 10,000 testing samples, same as that in [9]. We normalize the pixel values between 0 and 1. Table 1 describes the network structure, and similar to MNIST approach described above the initial learning rate is reduced to half for every 200 epochs. Table 3 summarizes the comparison of our proposed approach and various relevant methods on this database. It can be seen that our proposed methodology has achieved second best accuracy for this natural image classification task.

ISBI 2016 and ISBI 2017

To show the efficacy of the proposed objective function, we have conducted experimentation on both multi-class (MNIST and CIFAR-10) and binary class classification databases (ISBI 2016 and 2017). ISBI databases consist of dermoscopic lesion images for the diagnosis of skin cancer melanoma from the non-melanoma cases. ISBI 2016 database consists of 900 training set and 379 testing set. The database is unbalanced with 727 benign images and 173 melanoma images. Similarly, ISBI 2017 database consists of 2000 training samples and 600 testing samples. As stated by Wang et al. [32], minibatch learning with covariance estimates requires large batch size such that it could represent the entire population. Thus to overcome the batch size problem due to limited availability of ISBI training and testing data as well as due to large size of these images (\(224 \times 224\)) and limited amount of memory available in GPU, we first performed fine-tuning of pretrained ResNet-50 model which has 25,636,712 parameters and then extracted the features from the last convolutional layer. We used these 2-dimensional features as inputs to train MLP (multi-layer perceptron) or fully connected layers. The fully connected layers used for training with the proposed objective function can be represented as

$$\begin{aligned} \begin{aligned} \Theta _{\text {MLP}}&={\text {Input}}(900,2048) \rightarrow \\&{\text {Dense}}(2048)-{\text {Sigmoid}}-l2\, {\text {regularizer}}\rightarrow \\&{\text {Dense}}(1024)-{\text {Sigmoid}}-l2\, {\text {regularizer}}\rightarrow \\&{\text {Dense}}(1024)-{\text {Sigmoid}}-l2\, {\text {regularizer}}\rightarrow \\&{\text {Dense}}(100)-{\text {Sigmoid}}-l2\, {\text {regularizer}}. \end{aligned} \end{aligned}$$
(13)

\({\text {Sigmoid}}\) activation function is the most favoured activation function for shallow networks. We experimented using RELU and \(\tanh\) as well, but there was no significant improvement using them. Activation function adds non-linearity to the existing nodes of the network. For deeper networks, RELU is the best activation function since RELU increases the convergence rate. Disadvantage of RELU is that ReLU units can be fragile during training and can erode easily [1]. The following performance criteria are used for comparison of the proposed approach with the existing methodologies:

  • Accuracy: The ratio of correct prediction to that of total predictions, mathematical formulation as

    $$\begin{aligned} ACC= \frac{\text {TP}+{\text {TN}}}{{\text {TP}}+{\text {FP}}+{\text {TN}}+{\text {FN}}}, \end{aligned}$$
    (14)

    where TP is the True Positive, \({\text {TN}}\) is the True Negative, \({\text {FP}}\) is the False Positive, \({\text {FN}}\) is the Flase Negative.

  • Sensitivity: The ability of the algorithm to correctly predict the diseased cases (i.e., malignant):

    $$\begin{aligned} SE=\frac{{\text {TP}}}{{\text {TP}}+{\text {FN}}}. \end{aligned}$$
    (15)
  • Specificity: It is the ability of the algorithm to correctly predict the non-diseased cases (i.e., benign):

    $$\begin{aligned} SP= \frac{{\text {TN}}}{{\text {TN}}+{\text {FP}}}. \end{aligned}$$
    (16)
  • AUC: Area under receiver operating characteristic curve. It is the graph between true positive rate against the false positive rate.

  • Average precision: Average precision (AP) is the area under the precision-recall curve. The detailed explanation can be found in [16].

Fig. 3
figure 3

Loss with respect to number of epochs during training a loss vs epochs on ISBI 2016 database (with training data only as validation database is unavailable) b loss vs epochs on ISBI 2017 database (for both training and validation datasets)

Since DeepLDA approach uses the traditional LDA where we could get at most, number of classes minus one as the principal eigenvalues which in this database would be \((2-1)=\) 1. Thus, at the end there would be only one eigenvalue to maximize so as to have maximum inter class separation and minimum within class separation. In our approach, we use total class scatter matrix variance information to find the optimal projection among all the training data samples. This has enabled us to select up to \(n-1\), where n is the total number of training samples. The model loss plot with respect to varying epochs are shown in Fig. 3a for ISBI 2016 and Fig. 3b for ISBI 2017 databases, respectively. The plot shows that in both the cases the loss decreases evenly with increase in number of epochs and finally converges.

Tables 4 and 5 show the various comparison of this approach with the existing ones on ISBI 2016 and 2017 databases, respectively. The results obtained on these databases do not exceed the best accuracy so far obtained, but show a new approach to proceed by inheriting the class separability into the deep neural net as a result of changing the objective function. We implemented DeepLDA method [9] and experimented on ISBI databases. These tables show that our proposed approach achieves third best in its accuracy and AUC on ISBI 2016 database and second best for these metrics on ISBI 2017 database. Fisher vector based methods [35] and [36] use 32,768 (even after dimensionality reduction using principal component analysis) and 12,800 feature dimensions, respectively, for final feature matching, which are very high as compared to ours that uses only 2048 feature dimensions and 899 (number of samples \(-\, 1\)) for final classification purpose. For ISBI 2017 [28], the authors used two pretrained CNN models ResNet-101 and Inception-v4. Experimentation using large number of data requires huge computational resources such as large memory CUDA-compatible GPUs. The training time and complexity are huge as compared to our approach, which uses only 2048 features and still achieve competitive accuracy performances. Our method is simple, efficient, requires less computing time and complexity that can be generically applied to many computer vision classification tasks.

Table 4 Comparison of the proposed approach with the existing state-of-the-art methodologies on ISBI 2016 database
Table 5 Comparison of the proposed approach with the existing state-of-the-art methodologies on ISBI 2017 database

Conclusions

In this paper, we have proposed an objective function which would work for both binary as well as multi-class classification problems. The proposed loss function minimizes the within class variance and maximizes the total class variance. We experimented our method on popular databases for various applications like MNIST (hand written digit recognition) and CIFAR-10 (natural image classification), and we have shown that the proposed approach achieves competitive performances on these databases as compared to other methods. For the application of melanoma detection (skin cancer detection into melanoma and non-melanoma cases), since the number of images are few we trained the network using multi-layer perceptron and are able to achieve an accuracy of 84.9% on ISBI 2016 and 83.3% on ISBI 2017 databases. These experimental results show the efficacy of our proposed approach as compared to other methods for many computer vision classification tasks.