Robust feature space separation for deep convolutional neural network training

This paper introduces two deep convolutional neural network training techniques that lead to more robust feature subspace separation in comparison to traditional training. Assume that dataset has M labels. The first method creates M deep convolutional neural networks called {DCNNi}i=1M\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\text {DCNN}_i\}_{i=1}^{M}$$\end{document}. Each of the networks DCNNi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {DCNN}_i$$\end{document} is composed of a convolutional neural network (CNNi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {CNN}_i$$\end{document}) and a fully connected neural network (FCNNi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {FCNN}_i$$\end{document}). In training, a set of projection matrices {Pi}i=1M\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mathbf {P}_i\}_{i=1}^M$$\end{document} are created and adaptively updated as representations for feature subspaces {Si}i=1M\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mathcal {S}_i\}_{i=1}^M$$\end{document}. A rejection value is computed for each training based on its projections on feature subspaces. Each FCNNi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {FCNN}_i$$\end{document} acts as a binary classifier with a cost function whose main parameter is rejection values. A threshold value ti\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_i$$\end{document} is determined for ith\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i^{th}$$\end{document} network DCNNi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {DCNN}_i$$\end{document}. A testing strategy utilizing {ti}i=1M\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{t_i\}_{i=1}^M$$\end{document} is also introduced. The second method creates a single DCNN and it computes a cost function whose parameters depend on subspace separations using the geodesic distance on the Grasmannian manifold of subspaces Si\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {S}_i$$\end{document} and the sum of all remaining subspaces {Sj}j=1,j≠iM\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\mathcal {S}_j\}_{j=1,j\ne i}^M$$\end{document}. The proposed methods are tested using multiple network topologies. It is shown that while the first method works better for smaller networks, the second method performs better for complex architectures.


Introduction
There is an explosion of deep learning applications since the reintroduction of Convolutional Neural Networks (CNNs) in image classification [1] in 2012 and successive years to ImageNet dataset [2][3][4]. It has been successfully applied in self-driving cars for traffic related objects and person detection and classification [5], face recognition for social media platforms [6], natural language processing [7], and symbolic mathematics [8]. A deep convolutional neural network architecture was developed in [9] for classification of skin lesions using large set of images for training. Another architecture was developed in [10] for detection of diabetic retinopathy in retinal fundus photographs. Predicting 3-D structure of a protein using only amino acid sequences is a challenging task. DeepMind recently announced that their algorithm called AlphaFold can predict protein structures with atomic accuracy using deep learning [11,12].
There is a growing interest to explain the mathematical foundation of CNNs. The work in [13] uses wavelet theory to explain computational invariants in convolutional layers. It attempts to predict kernel parameters without the need for training. There are other attempts to establish mathematical foundation of deep convolutional neural networks [14,15].
As manifold assumption hypotheses, in many real-world problems, high-dimensional data approximately lies on a lower dimensional manifolds [15]. It has been shown that the trajectories observed in F video frames of M independently moving rigid bodies come from a union of M 4-dimensional subspaces of ℝ 2F [16]. It is experimentally shown that face images of a person with the same facial expression under different illumination approximately lies in a 9-dimensional subspace [17]. A general framework for clustering of data that comes from a union of independent subspaces is given in [18,19] and a practical algorithm is given in [20]. A detailed treatment of subspace segmentation problem can be found in [21]. Auto-encoder based deep learning is also applied to subspace clustering [22,23].
The popularity of CNN stems from the fact that it acts as an automatic feature extractor with its cascaded layers that can generate increasingly complex features from input datasets. As opposed to unnatural aspects of some existing feature extractors, such as SIFT [24] for vision data and MFCC [25] for audio data, the final layer of a CNN typically generates feature vectors that is linearly separable by a Fully Connected Neural network (FCNN). A typical Deep Convolutional Neural Network (DCNN) is trained using Stochastic Gradient Descent (SGD) based algorithm such as Adam [26]. The work in [27] uses manifold learning to improve feature representations of a transfer learning model. The work in [28] uses local linear embedding on output of each convolutional layer for particularly recognizing actions. Some other related work are in [29,30] This research proposes two novel DCNN architectures and associated training methods with the main goal of converting input data into feature vectors on more separable manifolds (or subspaces) at CNN output. The first method creates multiple DCNNs, each of which adaptively generates a projection matrix for each feature subspace [31,32]. A rejection value is computed for each label based on its projections on feature subspaces. The second method is based on the idea of maximizing the geodesic distance between a feature subspace and the sum of the remaining feature subspaces [33].

Paper contributions
-This work develops a classification method with using M deep convolutional neural networks in parallel. In training, a set of projection matrices is created and adaptively updated as representations for feature subspaces. A rejection value is computed for each training based on its projections on feature subspaces. A threshold value is determined for each network and a testing strategy utilizing all thresholds is also introduced. -This work also develops another classification method using a single DCNN what minimizes a cost function whose parameters depend on subspace separations using the geodesic distance on the Grasmannian manifold of a feature subspace and the sum of all remaining feature subspaces. -Experiments on real data (datasets on digits, alphabets, and fashion products) are performed to justify the proposed architectures. Five different deep convolutional network topologies are used to show that the proposed technique works better. The proposed methods are tested using five network topologies. It is shown that while the first method works better for smaller networks, the second method performs better for complex architectures. -A new matrix rank estimation technique is introduced.

Layout
Section gives a detailed treatment of the first novel deep convolutional network with multiple CNNs. Section introduces the second novel network with maximum subspace separation based on principal angles between subspaces. The numerical experiments are presented in Section and some future work is motivated in Section .

Feature space separation-first approach
. Let n j be n th input in class C j and let n j be the corresponding feature vector at the CNN output. Let d n j,i be the distance between n j and S i , which is computed at each iteration as in Equation (1).
The objective of training is to minimize d n j,i if j = i and maximize it otherwise. A set of threshold values {t i } M i=1 are generated as a result of trained DCNNs and these thresholds are used for testing when an unknown input is provided. Depending on some test topologies (as used in Section ), FCNN i , may be omitted.

Network training
Algorithm 1 summarizes high-level overall training and generation of threshold values. All network parameters are randomly initialized and training is performed using the steps described in the subsequent subsections.

Singular value decomposition for subspace approximation
Since all filter parameters are randomly initialized, those subspaces are expected to be not very good to start with. In order to match a subspace for ith feature space, all input data in C i is passed through CNN i and a data matrix i , whose columns are the feature vectors at the CNN output for C i , is created. Singular Value Decomposition (SVD) of i is taken and its rank r i is estimated. In this work, a new rank estimation algorithm was developed as described in Algorithm 2. Some other rank estimation techniques can be found in [34,35]. Let the effective rank of i be r i . Then,

DCNN i training
Algorithm 3 gives the details of training for each DCNN i . Note that the rejection is defined as n j,i = ( − i ) n j and it is fed into FCNN i .

Computing class separation
Assume all input set C j is feedforwarded via CNN i . Let j,i be a vector whose entries are the norms of rejection values, i.e., distances d n j,i for all n input vectors in C j . Let min( j,i ) ), max( j,i ) , and mean( j,i ) be the minimum, maximum, and mean values of j,i . Then, max( i,i ) is compared with min(min ( j,i ) M j=1,j≠i ) to assess if full class separation was achieved. Algorithm 4 summarizes the steps.

Training algorithm-faster
In order to speed up training process, it is possible to use only non-separated input for the next training iteration. The class separation values are computed every K iterations and the input sets is updated accordingly. The entire dataset of the corresponding class is still used for computing i . Algorithm 5 presents the steps.

Computing thresholds
The set of threshold values {t i } M i=1 are computed after completion of training. Even though there are multiple ways that can be considered, this work uses three approaches as listed below. Each threshold is used to determine if an unknown input belongs to a particular class. Let be an unknown input that is passed though DCNN i . If the rejection value of is less than t i , then belongs to i th class.

Numerical results
In order to find the label of an input image x, it is feedforwarded via all DCNNs and associated rejection values are computed and they are compared with corresponding threshold values. 0 or 1 is used to show x's connection to that class. If there is only one network connection, that x is labeled as that class. If there does not exist a connection, the rejection values are ranked and x is labeled as the class with the minimum rejection value. Algorithm 6 summarizes the steps to determine the class label of an unknown image.   Figure 2 illustrates the approach. We consider a single DCNN and train it using all available data. This is called pre-training. After the pre-training, each input class C i is passed via CNN and a data matrix i whose columns are features corresponding each input data in C i is constructed. Using SVD of i , a subspace is matched to i and it is called S i . Then, separation of S i form the sum of the remaining subspaces is computed and the network parameters are updated to minimize the separation. Algorithm 7 gives the steps for training of DCNN.

Feature space separation-second approach
There are different measures for separation of subspaces. Each subspace can be represented as a point in a Grassmannian manifold [36] and various distances such as geodesic arc length, chordal distance, or projection distance can be considered. In this work, the projection distance is considered as follows: where 1 , 2 , … , p are the principal angles between S and U . The principal angles are calculated as follows (using concepts from Algorithm 7).
-Let and i [1 ∶ r i ] be an orthonormal basis matrices for U = M ∑ j=1,j≠i S j and S i , respectively.

Results for first method
MNIST [37] (handwritten digits - Figure 3), and Fashion-MNIST [38] (fashion products - Figure 4)  In order to measure impacts of size on performance, we constructed three network topologies. Another topology with dropout layer to FCNN is also considered. Finally, a network topology with only CNN without FCNN is tested. Table 1 provides more details. The proposed method is tested with five different topologies (as shown in Table 1) and compared to the results obtained with traditional DCNN approach. Out of five, three topologies are to assess the effect of size on the performance, one topology topology adds dropout layers to the fully connected network and one topology has only CNN but not FCNN. Since the last topology does not have label categories layer, it cannot be tested traditionally. The results are shown in Table 2. Our first method performs better with small sized networks. This can be observed with Topologies 2 and 3. Our method performs better with smaller number of filters. The overfitting problem for traditional DCNN training is addressed with introduction of dropout layer at Topology 4. With this, performance is improved without any impacting the new approach. This greatly improves the performance while it does not have an effect on the new approach. Due to having an iteratively refined subspace to describe a class, the subspace overfits to features that occur more. In Topology 5, the CNN achieves a high accuracy close to the one with FCNN in Topology 3. In other words, CNN is able to generate features that can be separated without an FCNN.

Results for second method
In this part, EMNIST dataset [39], that includes digits and letters, is used for testing purposes. The EMNIST (Extended MNIST) data that support the findings of this study are available from Western Sydney University repository [https:// rds. weste rnsyd ney. edu. au/ Insti tutes/ MARCS/ BENS/ EMNIST/ emnist-gzip. zip [39]]. We considered five network topologies to reflect different complexities and impact of dropout layer. The network topologies and the performance results are shown respectively in Table 3 for and Table 4. The experimental results show that the angle based approach performs better for all topologies. It should be noted that in order to rule out that improvements were due to retraining of FCNN, the same extra training for Network-2 was used and 96.02% performance was obtained. In other words, traditional training (CNN + FCNN) generates 94.24%, traditional training with additional FCNN training generates 96.02%, and the new architecture generates 98.37%.

Conclusions
This paper introduced two methods that aim at enhancing feature subspace separation during the training process. The first method creates multiple deep convolutional neural networks and network parameters are optimized based on projection of data on subspaces. In the second method, the geodesic distances on Grassmanian manifolds of subspaces is minimized for a particular feature subspaces and the sum of all remaining feature subspaces. As a future work, other subspace based methods, representing each feature subspace by some orthogonal subspaces and training to keep orthogonality as new data training data arrives should also improve accuracy. Such a network training may be more robust especially for adversarial effects.

Materials and methods
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.