Deep Regularized Discriminative Network

Traditional linear discriminant analysis (LDA) approach discards the eigenvalues which are very small or equivalent to zero, but quite often eigenvectors corresponding to zero eigenvalues are the important dimensions for discriminant analysis. We propose an objective function which would utilize both the principal as well as nullspace eigenvalues and simultaneously inherit the class separability information onto its latent space representation. The idea is to build a convolutional neural network (CNN) and perform the regularized discriminant analysis on top of this and train it in an end-to-end fashion. The backpropagation is performed with a suitable optimizer to update the parameters so that the whole CNN approach minimizes the within class variance and maximizes the total class variance information suitable for both multi-class and binary class classification problems. Experimental results on four databases for multiple computer vision classification tasks show the efficacy of our proposed approach as compared to other popular methods.


Introduction
Linear discriminant analysis (LDA) is a method from multivariate statistics which attempts to find a linear projection of high-dimensional observations onto a lower-dimensional space [10]. It finds the optimal decision boundaries in the resulting lower dimensional subspace. LDA is an efficient way to separate the features on the basis of class information, but since it requires inverse operation it often becomes problematic if the dimension becomes very high as compared to the number of available training samples. Thereby it ignores the eigenvectors corresponding to zero eigenvalues so as to have the within class scatter matrix non-singular. In Sharma et al. [29], an improved regularized LDA is proposed which is carried out by adding a perturbation term to the diagonal elements of within class matrix to make it non-singular and invertible. However, the eigenvectors corresponding to zero eigenvalues also contain the important class discriminatory information as reported in [6,17,19,27]. Thus, we aim to utilize both the principal as well as nullspace eigenvalues and extend the beneficial properties of the proposed regularized fisher method (low intra-class variability, high totalclass variability, optimal decision boundaries). This is done by reformulating its objective to learn linearly separable representations based on a deep neural network (DNN) for both binary as well as multi-class problem.
LDA is used widely as a supervised dimensionality reduction method in computer vision and pattern recognition. Its recent generalization to non-Euclidean Grassmann manifolds can be found in [33]. This aims to impose the highest possible variance among classes, by maximizing the between-class distances, whilst minimizing the within-class scattering. Recently, deep learning combined with various multivariate statistics methods have achieved great success [12]. Andrew et al. [4] introduced a deep canonical correlation analysis (DCCA) which can be viewed as a nonlinear extension of CCA . In their evaluations, they argued that DCCA learns representations with significantly higher correlation than those learned by CCA and Kernel (nonlinear) CCA. They experimented using the MNIST handwritten data and simultaneous recording of articulatory and acoustic data. Ghassabeh et al. [13] presents new adaptive algorithms for online feature extraction using principal component analysis (PCA) and LDA for classification purpose. In Al-Waisy et al. [2], they have merged the advantages of local handcrafted feature descriptors with the Deep Belief Networks for the face recognition problem in unconstrained conditions and have obtained better performances.
PCANet proposed by Chan et al. [5] which includes cascading of PCA, binary hashing and block histogram computations. This can be seen as an unsupervised convolutional deep learning approach. Due to computational complexity these multi-stage filter banks are limited to two stages but can be extended to any number. They also experimented further modifications on PCANet as RandNet and LDANet. RandNet and LDANet share the same methodology like PCANet, but their cascaded filters are either selected randomly as in RandNet or learned from LDA in case of LDANet. Lifkooee et al. [24] combines regular deep convolutional neural network with the Laplacian of Gaussian filter (LoG) right before fully connected layer and they have shown that the proposed feature descriptor along with LoG introduced in CNN further improves the performance of deep learning.
Stuhlsatz et al. [31] initially proposed the idea of combining LDA with neural networks. In their proposed approach, they pre-train a stack of restricted Boltzmann machines and this pre-trained model is finetuned with respect to a linear discriminant criterion. LDA has the disadvantage that it overemphasizes large distances at the cost of confusing neighbouring classes. Thus, to tackle this problem, they introduced a heuristic weighing scheme for computing the within-class scatter matrix required for LDA optimization. The LDA based objective function proposed by Dorfer et al. [9] is a non-linear extension of classic LDA where the objective function is obtained from the general LDA eigenvalue problem while still allowing to train the CNN architecture with stochastic gradient descent and back-propagation.
In this paper, we propose to modify the LDA based objective function which would utilize both the principal as well as nullspace eigenvalues onto its latent space representation for both multi-class as well as binary class problem. Extensive experimental results on multiple computer vision classification tasks illustrates the superiority of our proposed approach as compared to other popular methods. Below, we describe our proposed method in details.

Proposed Approach
The approaches mentioned so far are based on the study of multi-variate statistics. In our work, we propose to train a CNN architecture in an end-to-end fashion with a new objective function which would enable the network to inherit the property of maximizing the total variation and minimizing the within class variation.
Deep Learning has become state-of-the-art for many image based applications of classification, object recognition, segmentation, image captioning and natural language processing [14,26]. The mathematical model of Convolutional Neural Network (CNN) is explained by Kuo et al. [22] where the fundamental questions about the structure of the convolutional neural networks is explained. There are many variations of deep convolutional neural networks for various vision tasks. The intuition behind our approach is to use the proposed regularized Fisher method as the objective function on top of a powerful feature learning model. The optimization of parameters is carried by backpropagating the error of the proposed objective function through the entire network. One of our objectives in this work is to come up with a CNN architecture that can be generically applied to many computer vision classification tasks. For experimental evaluation, we evaluated our proposed objective function on various benchmark databases like MNIST (handwritten digit recognition), CIFAR-10 (natural image classification) and ISBI (skin cancer detection into melanoma and non-melnoma cases) to show that the objective function is effective for both multi-class as well as binary class classification problems.

Deep Regularized Discriminative Network over simple ConvNet
Deep learning networks are different from the simple single-hidden-layer neural networks by their depth. Deeplearning networks effectively learn the features automatically without human intervention, unlike most traditional machine-learning algorithms. A neural network with P hidden layers is represented as a non-linear function f (Θ) , where Θ = {Θ 1 , … , Θ P } . In supervised learning for N number of samples, we have x = {x 1 , … , x N } as training data and y = {y 1 , … , y N } ∈ 1, … , C , where C is the number of classes. In the last layer, we have softmax as the classifier which gives the normalized probability of the data that belongs to a particular class. The output, . The network is optimized using stochastic gradient descent or any other optimizer like Adam with the goal of finding optimal model parameters Θ by minimizing the objective function l i (Θ): where l i (Θ) = f ((x i , Θ), y i ) . For categorical cross entropy (CCE), the loss function is defined as

SN Computer Science
where p i,j is the network output probability and y i,j is 1 if observation x i belongs to class y i for (j = y i ) and 0 otherwise. Figure 1 shows the deep regularized network where the objective is different from the CCE in maximizing the total scatter matrix eigenvalues and minimizing the within class scatter matrix eigenvalues. In the following subsections, detail description of the proposed objective function and the related analysis are discussed.

Proposed Objective Function
Linear discriminant analysis tries to find out the axes which maximize the between-class scatter matrix S b , while minimizing the within-class scatter matrix S w in the projective subspace A ∈ ℝ l×d . The projective subspace is a lower dimensional subspace, i.e., l = C − 1 where C is the number of classes. The resulting projection matrix onto this subspace x i A T are maximally separated in this space [10]. Fisher criterion is defined as the ratio of betweenclass and within-class variances, given by Here, W is the weight vector. To compute the within class scatter matrix: The total scatter matrix is computed using where X is the input data matrix; in our case it would be the output of the CNN model and N c is the sample numbers in that particular class. N is the total samples and X c = X c − m c , m c is the mean of that class, X = X − m where m is the total mean of the samples. The output predicted values from the CNN model (y_pred) is used as X values for the computation of S w , as in (5). To extract discriminative features, at first we perform eigen decomposition of the within-class scatter matrix S w , given by Here, Φ contains the eigenvectors and Λ are the eigenvalues of S w . Then the eigenvectors are sorted according to the eigenvalues in descending order. Matrix Φ is then split into W 1 and W 2 , where W 1 is the matrix which contains the eigenvectors corresponding to those eigenvalues which are greater than a certain minimum variance. For our experimentation, we took minimum variance value as 1e−2. W 2 matrix are the eigenvectors corresponding to those eigenvalues whose variance are less than the minimum variance. W 1 matrix is divided with the square root of the corresponding eigenvalues and W 2 matrix is divided with the square root of the minimum eigenvalues. These two matrices are concatenated to form Ψ as shown in (8) and it is multiplied with the y_pred to form the model output y: Then, we compute the total scatter matrix S t using (6). After computing the covariance matrix, the projection matrix Ω is selected by eigen decomposition of S t and selecting the eigenvectors in Φ wy according to the most significant eigenvalues Λ wy . Eigen decomposition of S t is given by Using the eigenvalues of S t matrix, we formulate the objective as,  Fig. 1 Schematic sketch of deep regularized discriminative network which learns the linear separability property in the latent representation. Here the objective is to maximize the eigenvalues so that the class separability also increases The objective of combining this with the deep neural net is that of maximization of the individual eigenvalues of S t and minimization of the eigenvalues of S w . In particular we expect maximization (minimization) of the eigenvalues of S t ( S w ) leads to maximizing (minimizing) separation in the respective eigenvector direction. Thus we would achieve the target of minimizing the within-class variation and maximizing the total variation. Deep neural network with categorical cross entropy (CCE) or binary cross entropy loss function does not take into account this aspect of discriminatory power. CCE main objective is to maximize the likelihood of the class labels according to the target labels.
Here the objective function is designed to consider only the k eigenvalues that do not exceed a certain threshold for variance maximization: where for symbol easiness we have considered Λ wy as v and n is the rank of the covariance matrix which is equal to one less than the number of samples ( n − 1 ). This formulation of objective function allows to train the deep networks with backpropagation in end-to-end fashion. This is similar to the classic LDA but it lifts the constraint that generally occurs for binary classification where C (number of classes) is 2 and the l-dimensional projection matrix with classic LDA method will be l = C − 1 , i.e., 2 − 1 = 1 . The above proposed objective function can be used for both multi-class as well as binary class classification problems.

Experimental Results
One of the key objectives of our work is to propose a CNN architecture that can be generically applied to many vision tasks. For our experimental evaluation we considered four publicly available databases, namely MNIST (hand written digits recognition), CIFAR-10 (natural scenes classification), ISBI 2016 (skin cancer classification) and ISBI 2017 (skin cancer classification). We compare our results with various other similar approaches available for vision classification.

Experimental Setup
The general structure of the CNN model is based on VGG model using 3 × 3 convolutions [30]. We experimented with and without including the BatchNormalization layer after each convolutional layer [18]. This layer helps in increasing the convergence speed and also the performance of the model. For non-linearity RELU is used, since it greatly accelerate the convergence rate of stochastic gradient descent or any other optimizer as compared to the sigmoid/tanh functions [20]. All the networks are trained using Adam optimizer, but the learning rate is decreased to half after every 200 epochs. The batch size for MNIST data and CIFAR-10 is 1000 and for ISBI 2016 and ISBI 2017, the batch size is 400, as the training data is quite small in case of ISBI databases. Related methods show that mini-batch learning on distribution parameters (in this case covariance matrices) is feasible if the batch-size is sufficiently large to be representative for the entire population [32]. Even though a large batch size is required to have stable estimates, it is limited by the data availability, image size and memory available on the GPU. Table 1 shows detail CNN model specifications for SN Computer Science the CIFAR-10 and MNIST databases. The total number of trainable parameters for CIFAR-10 model is 5,752,414 and MNIST is 467,486. In all our experiments, the proposed method is validated with the existing ones using the same corresponding datasets and protocols. They are implemented on a system with Intel Core i7 processor, 16GB RAM, and NVIDIA GeForce GTX-1050Ti GPU card.

MNIST
The MNIST database consists of 28 × 28 gray scale image with labels as 0 to 9. The data structure consists of 60,000 samples of which 50,000 is training data and 10,000 is validation data. The test sample consists of 10,000 images, same protocol as that in [9]. Since the proposed method requires large batch size, thus for MNIST we took 1000 as the batch size. The optimizer is the Adam optimizer and the initial learning rate is reduced to half for every 200 epochs. For final classification, we use the linear support vector machine (SVM) classifier. Table 2 shows the comparison of our proposed approach as compared to various relevant methods on MNIST database. From the results, it can be seen that our proposed method with new cost function is second best and comparable with the other state of-the-art reported performances. Therefore, it is evident that adding the latent space representation into the cost function, by maximizing the betweenclass and minimizing the within-class eigen representation efficiently learns the features required for classification. Thus the training is done in an unsupervised manner and using linear SVM, we do the final classification using the testing data. Figure 2a shows the evolution of mean eigenvalues of the total scatter matrix with varying epochs during the training. Figure 2b shows the eigenvalues of within class scatter matrix with respect to varying epochs, which initially increases but later decreases; thus achieving our objective of minimizing the within class and maximizing the total variation among different classes, as shown in Fig. 2a, b.

CIFAR-10
The CIFAR-10 database consists of 32 × 32 size image containing 10 different classes. The database structure consists of 50,000 training samples and 10,000 testing samples, same as that in [9]. We normalize the pixel values between 0 and 1. Table 1 describes the network structure, and similar to MNIST approach described above the initial learning rate is reduced to half for every 200 epochs. Table 3 summarizes the comparison of our proposed approach and various relevant methods on this database. It can be seen that our proposed methodology has achieved second best accuracy for this natural image classification task.   Table 2 Comparison of test errors (%) on MNIST database using our proposed approach and other relevant methodologies

ISBI 2016 and ISBI 2017
To show the efficacy of the proposed objective function, we have conducted experimentation on both multi-class (MNIST and CIFAR-10) and binary class classification databases (ISBI 2016 and 2017). ISBI databases consist of dermoscopic lesion images for the diagnosis of skin cancer melanoma from the non-melanoma cases. ISBI 2016 database consists of 900 training set and 379 testing set. The database is unbalanced with 727 benign images and 173 melanoma images. Similarly, ISBI 2017 database consists of 2000 training samples and 600 testing samples. As stated by Wang et al. [32], minibatch learning with covariance estimates requires large batch size such that it could represent the entire population. Thus to overcome the batch size problem due to limited availability of ISBI training and testing data as well as due to large size of these images ( 224 × 224 ) and limited amount of memory available in GPU, we first performed fine-tuning of pretrained ResNet-50 model which has 25,636,712 parameters and then extracted the features from the last convolutional layer. We used these 2-dimensional features as inputs to train MLP (multi-layer perceptron) or fully connected layers. The fully connected layers used for training with the proposed objective function can be represented as Sigmoid activation function is the most favoured activation function for shallow networks. We experimented using RELU and tanh as well, but there was no significant improvement using them. Activation function adds nonlinearity to the existing nodes of the network. For deeper networks, RELU is the best activation function since RELU increases the convergence rate. Disadvantage of RELU is that ReLU units can be fragile during training and can erode easily [1]. The following performance criteria are used for comparison of the proposed approach with the existing methodologies: -Accuracy: The ratio of correct prediction to that of total predictions, mathematical formulation as where TP is the True Positive, TN is the True Negative, FP is the False Positive, FN is the Flase Negative.   [15] 88.32 PCANet-2 [5] 78.67 DeepLDA [9] 92.42 Proposed method 90.04

SN Computer Science
-Sensitivity: The ability of the algorithm to correctly predict the diseased cases (i.e., malignant): -Specificity: It is the ability of the algorithm to correctly predict the non-diseased cases (i.e., benign): -AUC: Area under receiver operating characteristic curve. It is the graph between true positive rate against the false positive rate. -Average precision: Average precision (AP) is the area under the precision-recall curve. The detailed explanation can be found in [16].
Since DeepLDA approach uses the traditional LDA where we could get at most, number of classes minus (15) SE = TP TP + FN .
one as the principal eigenvalues which in this database would be (2 − 1) = 1. Thus, at the end there would be only one eigenvalue to maximize so as to have maximum inter class separation and minimum within class separation. In our approach, we use total class scatter matrix variance information to find the optimal projection among all the training data samples. This has enabled us to select up to n − 1 , where n is the total number of training samples. The model loss plot with respect to varying epochs are shown in Fig. 3a for ISBI 2016 and Fig. 3b for ISBI 2017 databases, respectively. The plot shows that in both the cases the loss decreases evenly with increase in number of epochs and finally converges. Tables 4 and 5 show the various comparison of this approach with the existing ones on ISBI 2016 and 2017 databases, respectively. The results obtained on these databases do not exceed the best accuracy so far obtained, but show a new approach to proceed by inheriting the class separability into the deep neural net as a result of changing the objective function. We implemented DeepLDA

Conclusions
In this paper, we have proposed an objective function which would work for both binary as well as multi-class classification problems. The proposed loss function minimizes the within class variance and maximizes the total class variance. We experimented our method on popular databases for various applications like MNIST (hand written digit recognition) and CIFAR-10 (natural image classification), and we have shown that the proposed approach achieves competitive performances on these databases as compared to other methods. For the application of melanoma detection (skin cancer detection into melanoma and non-melanoma cases), since the number of images are few we trained the network using multi-layer perceptron and are able to achieve an accuracy of 84.9% on ISBI 2016 and 83.3% on ISBI 2017 databases. These experimental results show the efficacy of our proposed approach as compared to other methods for many computer vision classification tasks.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.