A dimensionality reduction approach for convolutional neural networks

The focus of this work is on the application of classical Model Order Reduction techniques, such as Active Subspaces and Proper Orthogonal Decomposition, to Deep Neural Networks. We propose a generic methodology to reduce the number of layers in a pre-trained network by combining the aforementioned techniques for dimensionality reduction with input-output mappings, such as Polynomial Chaos Expansion and Feedforward Neural Networks. The motivation behind compressing the architecture of an existing Convolutional Neural Network arises from its usage in embedded systems with specific storage constraints. The conducted numerical tests demonstrate that the resulting reduced networks can achieve a level of accuracy comparable to the original Convolutional Neural Network being examined, while also saving memory allocation. Our primary emphasis lies in the field of image recognition, where we tested our methodology using VGG-16 and ResNet-110 architectures against three different datasets: CIFAR-10, CIFAR-100, and a custom dataset.


Introduction and motivations
Neural network is a widespread machine learning technique, increasingly employed in many different fields such as computer vision [28], natural language processing [52], robotics [32], and speech recognition [19].The accuracy of such models is strictly related to the number of layers, neurons, and inputs [17,25,47], therefore, to tackle even more complex problems, architectures are forced to go deep.If on one hand we have an increasing precision, on the other hand the high number of degrees of freedom results in a longer optimization step and, on a practical side, a bigger architecture to manage.The dimension of the network rarely is considered a bottleneck of this methodology, but the diffusion of neural networks in many engineering fields led to its employment also in embedded systems [40], that typically show a limited hardware.In these contexts the size of the architecture can be an additional constraint, requiring for a reduction in the number of degrees of freedom of the network.
Finding the intrinsic dimension of neural networks is a very challenging task and, to the best of authors knowledge, not supported by rigorous theoretical proofs.Among the different proposed methods, we mention network pruning and sharing [20,30,29], low-rank matrix and tensor factorization [43,54,33], parameter quantization [23,13], and knowledge distillation [22,37].In this contribution we propose an extension of the idea explored in [9], where the Active Subspace (AS) property and Polynomial Chaos Expansion (PCE) are exploited to provide a reduced and more robust version of the original network.While such contribution analyzed the AS capability for reducing a deep architecture, we aim here to provide a generic framework for neural network reduction, investigating other mathematical tools rather than AS and PCE.Mimicking the procedure presented in [9], the original architecture is initially splitted in two cascading parts, the pre-and post-model : we assume that the second one brings a negligible contribution to the final outcome, giving us the possibility to approximate such part of the model without introducing a larger error.A response surface (or in more general terms, an input-output mapping) is indeed built to fit the data, replacing the last layers of the network.This response surface may belong to an high-dimensional space, since the input dimension is equal to the dimension of the output features of the pre-model.It implies that, to keep the reduction computationally affordable, we also need a dimensionality reduction of the pre-model outputs (which, we remark, are also the input parameters of the response surface).Combining all these ingredients, we obtain a reduced version of the network, containing just few of the initial layers, but with a precision comparable to the full model.We specify that the numerical experiments we are going to present involve only Convolutional Neural Networks (CNNs), but the generality of the methodology allows in principle also for applicability to other models.
We explore in this contribution different tools for the dimensional reduction and for the response surface: in addition to AS and PCE, already tested in the aforementioned reference, we employ Proper Orthogonal Decomposition (POD) and Feedforward Neural Network (FNN).The first is a well established technique for model order reduction [41,44,42] that, similarly to AS, compress the data by projecting it onto a low-dimensional space.FNN is instead applied to construct the surface response, as alternative to PCE.The advantage of FNN over PCE is twofold: i,) the simplified input-output mapping (thanks to the low-dimensional space) allows to use a FNN with few layers and neurons, reducing further the already minimal space demand of the PCE method; ii,) on the programming side, the possibility to approximate part of the neural network with another network makes the software integration easier, especially when the hosting system is embedded.
The article is organized as follows.Section 2 provides an algorithmic overview of all the numerical methods involved in the reduction framework (AS in section 2.1.1,POD in section 2.1.2,PCE in section 2.2.1, and FNN in section 2.2.2), while in section 3 we cover in details the framework to reduce the neural networks.In section 4 we present the results obtained by reducing with the proposed methodology a benchmark CNN designed for image recognition.We repeat such analysis using two different datasets during the initial learning step, investigating the dependency of the results on the original problem.Finally in section 5 we summarize the entire procedure and propose some future perspectives to enhance the framework.

Numerical tools
We introduce in this section all the techniques employed for the reduction of the network, in order to make easier to understand the framework in Section 3.

Dimensionality reduction techniques
The subsection is devoted to an algorithmic overview of the reduction methods tested within this contribution, the Active Subspace (AS) property and the Proper Orthogonal Decomposition (POD).Widely employed in the reduced order modeling community, such techniques are used to reduce the dimensionality of the intermediate convolutive features, but we postpone to the next section the details.We just specify that, even if in this work we focus on AS and POD, the framework is generic, allowing in principle to replace these two with other methods for reducing the dimensionality.

Active Subspaces
Active Subspaces (AS) [7,8] method is a reduction tool used to identify important directions in the parameter space by exploiting the gradients of the function of interest.Such information allows to apply a rotational transformation to the domain in order to obtain an approximation of the original function in lower dimension.Let µ = [µ 1 . . .µ n ] T ∈ R n represent a n-dimensional variable with a probability density function ρ(µ), and let g be the function of interest, g(µ) : R n → R. We assume here g is scalar and continuous (for the vector-valued extension see [38,53]).Starting from this, an uncentered covariance matrix C of the gradient of g can be constructed by considering the average of the outer product of the gradient with itself: where the symbol E[•] denotes the expected value, and ∇ µ g ≡ ∇g(µ).We assume the gradients are computed during the simulation, otherwise if not provided they can be approximated with different techniques such as local linear models, global models, finite difference, or Gaussian process [1,50], for example.Since C is symmetric it admits the following eigenvalue decomposition: where V is the n×n orthogonal matrix whose columns {v 1 , . . ., v n } are the normalized eigenvectors of C, whereas Λ is a diagonal matrix containing the corresponding non-negative eigenvalues λ i , for i = 1, . . ., n, arranged in descending order.We can decompose these two matrices as: 3) The space spanned by V 1 columns is called the active subspace of dimension n AS < n, whereas the inactive subspace is defined as the range of the remaining eigenvectors in V 2 .Once we have defined these spaces, the input µ ∈ R n can be reduced to a low-dimensional vector μ1 ∈ R nAS using V 1 as projection map.To be more precise, any µ ∈ R n can be expressed in this way using the decomposition in Eq. (2.3) and the properties of V: where the two new variables μ1 and μ2 are the active and inactive variable respectively: For the actual computations of the AS we have used the open source Python package called ATHENA [39].

Proper Orthogonal Decomposition
In this section, we are going to describe the Proper Orthogonal Decomposition (POD) approach of reduce order modeling [21] by decreasing the number of degrees of freedom of a parametric system.
In particular we are focusing on the POD with interpolation (PODI) method [5,6,31,10,11,12].Let S = [u 1 . . .u n S ] be the matrix of snapshots, i.e. the full order system outputs u i ∈ R N .Once these solutions are collected we aim to describe them as a linear combination of few main structures, the POD modes, and thus project them onto a low dimensional space spanned by these modes.In order to calculate the POD modes, we need to compute the singular value decomposition (SVD) of the snapshots matrix S: where the left-singular vectors, i.e. the columns of the unitary matrix Ψ, are the POD modes, and the diagonal matrix Σ contains the corresponding singular values in decreasing order.Therefore, by selecting the first modes we are retaining only the most energetic ones and we can construct a reduced space into which we project the high-fidelity solutions.Hence we obtain: where Ψ NPOD is the matrix containing only the first N POD modes and the columns of S POD represent the reduced snapshot ũi ∈ R NPOD .

Input-output mapping
Once the outputs of the intermediate layer are dimensionally reduced, we need to correlate the latter to the final output of the original network, e.g. the belonging classes in an image identification problem.An input-output mapping is then built starting from the input dataset.The next subsections are dedicated to the algorithmic overview of the two methods explored for approximating the mapping: the Polynomial Chaos Expansion (PCE) [51] and the Feed-forward Neural Network (FNN) [14].

Polynomial Chaos Expansion
The Polynomial Chaos Expansion (PCE) theory was initially proposed by Wiener in [49], showing that a real-valued random variable X : R R → R can be decomposed in the following way: i.e. as an infinite sum of orthogonal polynomials weighted by unknown deterministic coefficients c j [24].The vector ξ = (ξ 1 , . . ., ξ R ) represents the multi-dimensional random vector, where each element is associated with uncertain input parameters, while φ j (ξ) are multivariate orthogonal polynomials, that can be decomposed into products of one-dimensional orthogonal polynomials with different variables.
We can approximate the infinite sum in Eq. (2.8) by truncating it at the (P + 1)-th term, such that: with the number of unknown coefficients in this summation given by P + 1 = (p+R)!p!R! [15], where p is the degree of the polynomial we are considering in the R-dimensional space.When the parameters ξ 1 , . . ., ξ R are independent, φ j (ξ) can be decomposed into products of onedimensional functions: In order to determine the PCE, we need to find out the polynomial chaos expansion coefficients c j for j = 0, . . ., P , and the one-dimensional orthogonal polynomial Based on the work of Askey and Wilson [4], we can provide the orthogonal polynomials for different distribution.One of the possible choices is respresented by the Gaussian distribution with the related Hermite polynomials.The estimation of the coefficients of PCE can be carried out in different ways [46]: following a projection method based on the orthogonality of the polynomials or following a regression method, that is the one we are going to describe.In order to determine the coefficients c j , we need to solve a minimization problem: where N PCE indicates the total number of realizations of the input vector we are considering, whereas X represents the real output of the model.In order to solve equation eq.(2.11) we need to consider the following matrix: Thus, the solution of equation eq.(2.11) is computed by a least-square optimization: If the matrix Φ T Φ is ill-conditioned, as it may happen, the singular value decomposition method should be employed.

Feedforward Neural Network
A Feedforward Neural Network (FNN), also called multilayer perceptron, is a popular neural network model, usually employed for function regression [14].As depicted in fig. 1, it mainly consists of an input layer, an output layer and a certain number of hidden layers1 , where the processing units composing them are called neurons.Each neuron is defined by a weight vector, that characterizes the strength of the connection with the neurons in the next layer.More technically speaking, let x ∈ R nin be the input vector and M the total number of hidden  layers of the FNN.The output vector h ∈ R nout is obtained through the application of an activation function to the weighted sum of all the inputs arriving to it.The activation function is used in order to introduce non-linearity in the network and some common choices are represented by the ReLU function, the sigmoid, the logistic function or the radial activation functions, [17].
In order to understand better how to get the general formula (2.15), we start by considering a simple FNN with a single output and one hidden layer.In this case the final output is given by: where σ is the activation function, W = {w i } nin i=1 represents the weights of the net and b the bias2 .Thus, if we consider M layers the final output can be seen as a weighted sum of its input followed by the activation function, where each input can be rewritten in the same way: where n m , m = 1, . . ., M , represents the number of neurons in layer m, whereas n in and n out are the neurons in the input and output layers respectively, W m = (w m ki ) ki , k = 1, . . ., n m , i = 1, . . ., n m−1 indicates the weight matrix related to layer m.Note that the first number in any weight's subscript matches the index of the neuron in the next layer and the second number matches the index of the neuron in the previous layer.One of the main characteristics of an FNN is the ability to learn from observational data during the so-called training process.In this phase the net is acquiring knowledge from our dataset by minimizing the loss function3 L: where h = {h j } nout j=0 represents the expected output and ĥ = ĥ(x; W ) = { ĥj (x; W )} nout j=0 is the prediction made by our FNN.In order to solve this minimization problem the Backpropagation algorithm [36] is employed.Therefore, the model's parameters are optimized by adjusting network's weights as follows: where is the learning rate, appropriately chosen according to the problem in exam, and t represents the training epoch, i.e. a complete repetition of the parameter update that involves the complete training dataset at one time.The computation of the gradients is then performed by exploiting the chain rule.

The Reduced Artificial Neural Networks
We provide in this section the rigorous description of the proposed framework (summarized in fig. 2 and algorithm 1), which has the final goal of reducing in dimensionality a generic artificial neural network (ANN).Indeed, the only assumption we make regarding the original network is that it is composed by L layers.
Network splitting At the beginning, the original network AN N : R n0 → R n L is split in two different parts such that the first l layers constitutes the pre-model while the last L−l layers form the so-called post-model.Describing the network as composition of functions we can formally define the pre-and the post-model as: where the functions f j : R nj−1 → R nj for j = 1, . . ., L, represent the different layers of the network -e.g.convolutional, fully connected, batch-normalization, ReLU, pooling layers.The original model can then be rewritten as for any 1 ≤ l < L. The reduction of the network effectively happens approximating the post-model, which means that the pre-model is actually copied from the original network to the reduced one.
Before proceeding with the algorithm to explain how the post-model is approximated, we specify that the index l, denoting the cut-off layer, is the only parameter of this initial step, and it plays an important role in the final outcome.This index indeed defines how many layers of the original network are kept in the reduced architecture, controlling in few words how many information of the original network we are discarding.It should be then chosen empirically based on considerations about the network and the dataset at hand, balancing the final accuracy and the compression ratio.
Algorithm 1 Pseudo-code for the construction of the reduce artificial neural network Inputs: • a dataset with N train input samples D 0 = {x 0 j } Ntrain j=1 , • an artificial neural network AN N , • {y j } Ntrain j=1 real output of the AN N , • reduced dimension r, • index of the cut-off layer l  Dimensionality reduction As introduced previously, we aim to project the output l of the pre-model onto a low-dimensional space using reduction techniques as: • Active Subspaces: as described in section 2.1.1 and in [9], we consider a function g l defined by: in order to extract the most important directions and determine the projection matrix used to reduce the pre-model output.
• Proper Orthogonal Decomposition: as discussed in section 2.1.2,the SVD decomposition eq.(2.6) is exploited to compute the projection matrix Ψ r and thus the reduced solution Input-Output mapping The last part of the reduced net is dedicated to the classification of the output coming from the reduction layer.In order to do so, we have employed two different techniques: • the Polynomial Chaos Expansion introduced in section 2.2.1.As described in equation eq.(2.9), the final output of the network y = AN N (x 0 ) ∈ R n L , i.e. the true response of the model, can be approximated in the following way: where φ α (z) are the multivariate polynomial functions chosen based on the probability density function ρ associated with z.Therefore, the estimation of coefficients c α is carried out by solving the minimization problem eq.(2.11): (3.6) • a Feedforward Neural Network described in section 2.2.2.In this case, the output of the reduction layer z coincides with the network input and by using Equation eq. ( 2.15) we obtain that the final output ŷ of the reduced net4 is determined by: where n out corresponds to the number of categories that compose the dataset in exam, and σ is the Softplus function: (3.8)

Training phase
Once the reduced version of the network in exam is constructed, we need to train it.Following [9], for the training phase of the reduced ANN the technique of knowledge distillation [22], is used.A knowledge distillation framework contains a large pre-trained teacher model, our full network, and a small student model, in our case AN N red .Therefore, the main goal is to train efficiently the student network under the guide of the teacher network in order to gain a comparable or even superior performance.Let y be a vector of logits, i.e. the output of last layer in a deep neural network.The probability p i that the input belongs to the i-th class is given by the softmax function . (3.9) As described in [22], a temperature factor T need to be introduced in order to control the importance of each target where if T → ∞ all classes have the same probability, whereas if T → 0 the targets p i become one-hot labels.
First of all, we need to define the distillation loss, that matches the logits between the teacher model and the student model.As done in [9], the response-based knowledge is used to transfer the knowledge from the teacher to the student by mimicking the final prediction of the full net.Therefore, in this case the distillation loss [18,22] is given by: L D (p(y t , T ), p(y s , T )) = L KL (p(y t , T ), p(y s , T )), (3.11)where y t and y s indicate the logits of the teacher and student networks, respectively, whereas L KL represents the Kullback-Leibler (KL) divergence loss [26]: L KL ((p(y s , T ), p(y t , T )) T 2 j p j (y j t , T ) log p j (y j t , T ) p j (y j s , T ) . (3.12) The student loss is defined as the cross-entropy loss between the ground truth label and the logits of the student network [18]: where ŷ is a ground truth vector, characterized by having only the component corresponding to the ground truth label on the training sample set to 1 and the others are 0.Then, L CE represents the cross entropy loss As can be observed, both losses, eq.(3.11) and eq.(3.13), use the same logits of the student model but with different temperatures: T = τ > 1 in the distillation loss, and T = 1 in the student loss.Finally, the final loss is a weighted sum between the distillation loss and the student loss: where λ is the regularization parameter, x 0 is an input vector of the training set, and W are the parameters of the student model.

Numerical results
In this section we present a comparison between the results obtained with the different reduction methods proposed in terms of final accuracy, memory allocation, and speed of the procedure.

VGG-16
As test network, we use a Convolutional Neural Network (CNN), a class of ANN commonly applied for the problem of image recognition [35].Over the last 10 years, several CNN architectures have been presented [2,3] to tackle this problem, e.g.AlexNet, ResNet, Inception, VGGNet.As starting point to test our methods, we employ one of the VGG network architecture: VGG-16 [45].As can be seen in fig.3, the architecture is composed by: • 13 convolutional blocks: each block is made of a convolutional layer followed by a non-linear layer, i.e. the application of the activation function, in this case ReLU is used.
• 5 max-pooling layers, • 3 fully-connected layers.The total number of layers having tunable parameters is 16 of which 13 are the convolutional layers and 3 the fully connected layers.For this reason the name given to this ConvNet is VGG-16.

Dataset
For training and testing our net we use: • CIFAR-10 dataset [27], a computer-vision dataset used for object recognition.It consists of 60000 32 × 32 colour images divided in 10 non-overlapping classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
• Custom dataset, composed of 3448 32 × 32 colour images organized in 4 classes: 3 nonoverlapping classes and a mixed one, characterized by pictures with objects of different categories present at the same time.

Software
In order to implement and construct the reduced version of the convolutional neural network presented in the previous sections, we employed PyTorch [34] as development environment.We then used the open-source Python library SciPy [48] for scientific computing and the open source Python package ATHENA [39] for the actual computation of the active subspaces.

Results
We now present the results of the reduced net constructed starting from VGG-16 and based on CIFAR-10 and our custom dataset.First of all, the original network VGG-16 has been trained on each of the different dataset presented for 60 epochs 5 .From table 2 and table 3, it can be seen that at the end of this training VGG-16 gain a good accuracy: 77.98% for the CIFAR-10 and 95.65% for the custom dataset.We report the results obtained with different reduced versions of VGG-16 constructed following the steps of algorithm 1 and using three cut-off layers6 l: 5, 6, and 7, as done in [9].We remark that in the case of dimensionality reduction with the Active Subspaces technique, we employed the Frequent Direction method [16] implemented inside ATHENA to compute the AS.The value chosen for the parameter r, i.e. the dimension of the reduced space, is 50 both for AS and for POD in analogy with [9], where considerations on the structural analysis of VGG-16 can be found.
In the case a FNN is used to classify our pictures, we trained it for 500 epochs with our dataset before re-training the whole reduced net.In table 1, we summarize the results obtained by training the reduced net using different FNN architectures, i.e. a different number of hidden layers and also of hidden neurons, which are kept constant in each hidden layer of the net.In particular we are making a comparison between the storage needed for the FNN and the accuracy of the reduced net in exam at epoch 0, i.e. after its initialization, and at epoch 10, i.e. after the re-training of the whole reduced net.Thus, it can be seen that increasing the number of hidden layers and also of hidden neurons is not giving a gain in accuracy.For this reason, based on considerations about the final accuracy and the allocation in memory of the FNN (see table 1 for details), we decided to use the following architectures: • CIFAR10: FNN with 50 input neurons, 10 output neurons, and one hidden layer with 20 hidden neurons.
• Custom Dataset: FNN with 50 input neurons, 4 output neurons, and one hidden layer with 10 hidden neurons.Hence, after these steps, the reduced net has been re-trained for 10 epochs on each dataset.The results obtained are summarized in table 2 and table 3 showing a comparison between the different reduced nets in terms of accuracy (before and after the final training), memory storage, and time needed for the initialization and the training of the reduced net.As discussed previously for each reduced net, i.e.AS+PCE, AS+FNN, POD+FNN, we proposed in the tables the results achieved using three different cut-off layers: 5,6, and 7.
Information on memory allocation are important since in our context (in particular the custom dataset) we need to include a convolutional neural network in an embedded system with particular constraints on the storage.In both table 2 and table 3 it can be seen that the allocation required for the created reduced nets is decreased with respect to that of the original VGG-16.In fact, the checkpoint file7 stored for the full net occupies 56.14 Mb, whereas that of its reduced versions less than 10 Mb.It can also be noted that the use of a FNN instead of the PCE is saving space in memory of two order of magnitude: 10 −4 against 10 −2 .table 2 shows also that for the POD+FNN case the net is not requiring an additional training with the whole dataset since after the initialization, i.e. at epoch 0, its accuracy is acceptable and for index 7 is also already high.The immediate consequence of this is the saving of the time needed to gain a performing network, which is in the order of 5 hours.It can be seen that these considerations are consistent using a different set of data, as the custom dataset in exam.table 3 reports also how after the initialization POD+FNN has already a greater accuracy than VGG-16 for all the choices of l.
For both cases, it can be observed that the proposed reduced CNN achieved a similar accuracy (in most cases also greater) as the original VGG-16 but with much smaller storage.Furthermore, increasing the cut-off layer index l leads to higher accuracy since we are retaining more original features, but on the other hand there is a smaller compression ratio.For this reason, as pointed out before, the right choice for l is a trade-off between the level of accuracy and the reduction, and depends also on the field of application.

Conclusions and perspectives
In this paper we proposed a generic framework for reduction of neural networks, which aims to reduce the number of layers in the net at the expense of a minimal error in the final prediction.Such a reduction occurs by replacing a finite set of the network layers with a response surface, involving also dimensionality reduction techniques to operate on a low-dimensional space.We analyzed different dimensionality reduction methods, and we investigated how the combination of these techniques with different input-output mappings can lead to differences in the final accuracy.The creation of this reduced network has one main goal: the compression of existing deep neural network architecture in order to be included into an embedded system with memory and space constraints.The numerical experiments on a convolutional neural network show that the proposed techniques can produce a reduced version (in terms of number of layers and parameters) of an existing network with a saving in memory allocation, keeping the same good level of accuracy of the original CNN.Furthermore, from the results it emerges that the methodology combining POD with FNN leads also to a decrease in training time, which makes the proposed framework better than the inspiring method proposed in [9].The main drawback of this technique is the necessity to start with an already trained network for reducing it.Indeed, despite the saved space and memory, the learning procedure in many problems results the real bottleneck.In these cases, our framework could be extended in order to reducing the architecture dimension during the training, and not only once it is finished, inducing hopefully a remarkable speedup in the optimization step.

Figure 1 :
Figure 1: Schematic structure of a Feedforward Neural Network with 2 hidden layers.

1 :
AN N l pre , AN N l post = splitting net(AN N , l) 2: x l = AN N l pre (x 0 ) 3: z = reduce(x l , r) 4: ŷ = input output map(z, y) 5: Training of the constructed reduced net Output: Reduced Net AN N red

Figure 2 :
Figure 2: Graphical representation of the reduction method proposed for a CNN.

Table 1 :
Results obtained for the reduced net POD+FNN (7) trained on CIFAR10 with different structures for the FNN.

Table 3 :
Results obtained with a custom dataset.