1 Introduction and motivations

Neural networks are a widespread machine learning technique, increasingly employed in various fields such as computer vision [1,2,3], natural language processing [4, 5], robotics [6, 7], and speech recognition [8, 9]. The accuracy of such models is strictly related to the number of layers, neurons, and inputs [10,11,12], therefore, to tackle more complex problems, these architectures are forced to go in depth. While on the one hand we have increasing precision, on the other hand the high number of degrees of freedom translates into a longer optimization step and, from a practical point of view, into a larger architecture to manage. The dimension of the network is rarely considered a bottleneck of this methodology, but the diffusion of neural networks in many engineering fields led to its employment also in embedded systems [13,14,15], which typically show limited hardware. Deep vision algorithms are indeed developed using workstations with high computational resources, posing a challenge when deploying them in industrial applications. The vision devices, in which these nets need to be integrated, are often characterized by restricted memory sizes and low CPU performance [16,17,18]. In these contexts the size of the architecture can thus become an additional constraint, requiring a reduction in the network’s degrees of freedom.

Fig. 1
figure 1

Graphical representation of the problem and the proposed solution, as described in this contribution

Finding the intrinsic dimension of neural networks is a very challenging task and, to the best of the authors’ knowledge, lacks rigorous theoretical proofs. Various methods have been proposed, including network pruning and sharing [19,20,21,22,23], low-rank matrix and tensor factorization [24,25,26,27], parameter quantization [28,29,30], and knowledge distillation [31,32,33,34]. In this contribution (see Fig. 1), we present an extension of the idea explored in [35], where the Active Subspace (AS) property and Polynomial Chaos Expansion (PCE) are exploited to provide a reduced and more robust version of the original network. While such work has focused on analyzing the AS capability in reducing deep architectures, we aim here to provide a generic framework for neural network reduction, investigating other mathematical tools rather than AS and PCE. Mimicking the procedure presented in [35], the original architecture is initially split into two cascading parts: the pre- and post-model. We assume that the second one brings a negligible contribution to the final outcome, giving us the possibility to approximate such part of the model without introducing a significant error. A response surface (or in more general terms, an input-output mapping) is indeed built to fit the data, replacing the last layers of the network. This response surface may belong to a high-dimensional space since the input dimension is equal to the dimension of the output features of the pre-model. Consequently, in order to maintain the reduction computationally affordable, we also need to reduce the dimensionality of the pre-model outputs, which, it should be noted, are also the input parameters for the response surface. By combining all these ingredients, we obtain a reduced version of the network that only includes a few of the initial layers, but achieves a level of accuracy comparable to the full model. It is important to specify that the numerical experiments we are about to present exclusively involve Convolutional Neural Networks (CNNs), but the methodology can potentially be applied to other models as well.

In this contribution, we examine various tools for the dimensional reduction and response surface analysis. In addition to AS and PCE, already tested in the aforementioned reference, we employ Proper Orthogonal Decomposition (POD) and Feedforward Neural Network (FNN). The former, similar to AS, is a well-established technique for Model Order Reduction [36,37,38], which compresses the data by projecting it onto a lower-dimensional space. On the other hand, FNN is employed to construct the surface response as an alternative to PCE. The advantage of FNN over PCE is twofold: i,) the simplified input-output mapping (thanks to the low-dimensional space) allows for a FNN with fewer layers and neurons, further reducing the already minimal space demanded for the PCE method; ii,) from a programming perspective, the possibility to approximate a part of the neural network with another network makes the software integration easier, especially when the hosting system is embedded.

The article is organized as follows. Section 2 provides an algorithmic overview of all the numerical methods involved in the reduction framework. This includes an analysis of AS in  Section 2.1.1, POD in  Section 2.1.2, PCE in  Section 2.2.1, and FNN in  Section 2.2.2. In  Section 3, we delve into the details of the framework used to reduce the neural networks.  Section 4 is dedicated to presenting the results obtained by reducing benchmark CNNs designed for image recognition with the proposed methodology. We conduct this analysis using three different datasets during the initial learning step, investigating the dependency of the results on the original problem. Finally, in  Section 5 we summarize the entire procedure and propose some future perspectives to enhance the framework.

2 Numerical tools

In this section we introduce all the techniques employed for the reduction of the network, in order to facilitate the comprehension of the framework discussed in Section 3.

2.1 Dimensionality reduction techniques

This subsection provides an algorithmic overview of the reduction methods examined in this contribution: the Active Subspace (AS) property, and the Proper Orthogonal Decomposition (POD). Widely employed in the reduced order modeling community, such techniques are used here to decrease the dimensionality of the intermediate convolutive features. However, the specific details will be discussed in the next section. We just specify that, while this work concentrates on AS and POD, the framework is generic, allowing to replace these two methods with others for dimensionality reduction.

2.1.1 Active subspaces

Active Subspaces (AS) [39, 40] method is a reduction tool used to identify important directions in the parameter space by exploiting the gradients of the function of interest. Such information allows the application of a rotational transformation to the domain, in order to obtain an approximation of the original function but in a lower dimension. Let \(\varvec{\mu } = [\mu _1 \dots \mu _n]^T \in \mathbb {R}^{n}\) represent a n-dimensional variable with an associated probability density function \(\rho (\varvec{\mu })\), and let g be the function of interest, \(g(\varvec{\mu }): {\mathbb {R}}^n \rightarrow {\mathbb {R}}\). We assume here g is scalar and continuous (for the vector-valued extension see [41, 42]). Starting from this, an uncentered covariance matrix \({\textbf{C}}\) of the gradient of g can be constructed by considering the average of the outer product of the gradient with itself:

$$\begin{aligned} {\textbf{C}}=\mathbb {E}[\nabla g(\varvec{\mu })\nabla g(\varvec{\mu })^T] = \int (\nabla _{\varvec{\mu }} g)(\nabla _{\varvec{\mu }} g)^T \rho \text {d}\varvec{\mu }, \end{aligned}$$
(1)

where the symbol \(\mathbb {E}[\cdot ]\) denotes the expected value, and \(\nabla _{\varvec{\mu }} g \equiv \nabla g(\varvec{\mu })\). We assume the gradients are computed during the simulation, otherwise, if not provided, they can be approximated with different techniques such as local linear models, global models, finite difference, or Gaussian process [43,44,45], for example. Since \({\textbf{C}}\) is symmetric, it admits the following eigenvalue decomposition:

$$\begin{aligned} {\textbf{C}}= {\textbf{V}}{\varvec{\Lambda }}{\textbf{V}}^T, \quad {\varvec{\Lambda }}= \textrm{diag}(\lambda _1, \dots ,\lambda _n), ~~ \lambda _1\ge \cdots \ge \lambda _n\ge 0, \end{aligned}$$
(2)

where \({\textbf{V}}\) is the \(n \times n\) orthogonal matrix whose columns \(\{\textbf{v}^1, \dots , \textbf{v}^n \}\) are the normalized eigenvectors of \({\textbf{C}}\), whereas \({\varvec{\Lambda }}\) is a diagonal matrix containing the corresponding non-negative eigenvalues \(\lambda _i\), for \(i=1,\dots , n\), arranged in descending order.

We can decompose these two matrices as:

$$\begin{aligned} {\varvec{\Lambda }}= & {} \begin{bmatrix} {\varvec{\Lambda }}_1 &{} \\ &{}{\varvec{\Lambda }}_2 \end{bmatrix},\nonumber \\ {\textbf{V}}= & {} [{\textbf{V}}_1~~ {\textbf{V}}_2], \qquad {\textbf{V}}_1\in {\mathbb {R}}^{n\times n_{\text {AS}}}, ~~{\textbf{V}}_2\in {\mathbb {R}}^{n\times (n-n_{\text {AS}})}. \end{aligned}$$
(3)

The space spanned by \({\textbf{V}}_1\) columns is called the active subspace of dimension \(n_{\textrm{AS}} < n\), whereas the inactive subspace is defined as the range of the remaining eigenvectors in \({{\textbf{V}}}_2\). Once we have defined these spaces, the input \({\varvec{\mu }\in \mathbb {R}^n}\) can be reduced to a low-dimensional vector \(\tilde{\varvec{\mu }}_1\in {\mathbb {R}}^{n_{\text {AS}}}\) using \({\textbf{V}}_1\) as projection map. To be more precise, any \({\varvec{\mu }\in {\mathbb {R}}^n}\) can be expressed in this way using the decomposition in Eq. 3 and the properties of \({\textbf{V}}\):

$$\begin{aligned} \varvec{\mu }= {\textbf{V}}{\textbf{V}}^T\varvec{\mu }= {\textbf{V}}_1{\textbf{V}}_1^T\varvec{\mu }+ {\textbf{V}}_2{\textbf{V}}_2^T\varvec{\mu }= {\textbf{V}}_1\tilde{\varvec{\mu }}_1 + {\textbf{V}}_2\tilde{\varvec{\mu }}_2, \end{aligned}$$
(4)

where the two new variables \(\tilde{\varvec{\mu }}_1\) and \(\tilde{\varvec{\mu }}_2\) are the active and inactive variable respectively:

$$\begin{aligned} \tilde{\varvec{\mu }}_1 = {\textbf{V}}_1^T\varvec{\mu }\in {\mathbb {R}}^{n_{\text {AS}}}, \qquad ~ \tilde{\varvec{\mu }}_2 = {\textbf{V}}_2^T\varvec{\mu }\in {\mathbb {R}}^{n-n_{\text {AS}}}. \end{aligned}$$
(5)

For the actual computations of the AS, we have used the open-source Python package called ATHENA [46].

2.1.2 Proper orthogonal decomposition

In this section, we will discuss the Proper Orthogonal Decomposition (POD) approach of Reduce Order Modeling [36,37,38, 47] for reducing the number of degrees of freedom of a parametric system. Specifically, we will focus on the POD with interpolation (PODI) method [48,49,50].

Let \(\textbf{S} = [{\textbf{u}}^1\dots {\textbf{u}}^{n_S}]\) be the matrix of snapshots, i.e. the full order system outputs \({\textbf{u}}^i\in {\mathbb {R}}^N\). Once these solutions are collected, we aim to describe them as a linear combination of a few main structures, the POD modes, and thus project them onto a low dimensional space spanned by these modes. To calculate the POD modes, we need to compute the singular value decomposition (SVD) of the snapshots matrix \(\textbf{S}\):

$$\begin{aligned} \textbf{S} = \varvec{\Psi }\varvec{\Sigma }\varvec{\Theta }^T, \end{aligned}$$
(6)

where the left-singular vectors, i.e. the columns of the unitary matrix \(\varvec{\Psi }\), are the POD modes, and the diagonal matrix \(\varvec{\Sigma }\) contains the corresponding singular values in decreasing order. Therefore, by selecting the first modes we are retaining only the most energetic ones and we can construct a reduced space into which we project the high-fidelity solutions. As a results, we obtain:

$$\begin{aligned} \textbf{S}^{\text {POD}}=\varvec{\Psi }^T_{N_{\text {POD}}}\textbf{S}, \end{aligned}$$
(7)

where \(\varvec{\Psi }_{N_{\text {POD}}}\) is the matrix containing only the first \(N_{\text {POD}}\) modes, and the columns of \(\textbf{S}^{\text {POD}}\) represent the reduced snapshot \(\tilde{{\textbf{u}}}^i\in {\mathbb {R}}^{N_{\text {POD}}}\).

2.2 Input–output mapping

After reducing the dimensions of the outputs from the intermediate layer, we need to establish a correlation between these outputs and the final output of the original network. For example, in an image identification problem, this would involve determining the classes to which the image belongs. To achieve this, we construct an input–output mapping using the input dataset. The following subsections provide an algorithmic overview of the two methods that were explored to approximate this mapping: the Polynomial Chaos Expansion (PCE) [51] and the Feed–forward Neural Network (FNN) [52].

2.2.1 Polynomial chaos expansion

The Polynomial Chaos Expansion (PCE) theory was initially proposed by Wiener in [53], demonstrating that a real-valued random variable \(X:{\mathbb {R}}^R\rightarrow {\mathbb {R}}\) can be decomposed in the following manner:

$$\begin{aligned} X(\varvec{\xi }) = \sum _{j=0}^{\infty } c_j {\varvec{\phi }}_j(\varvec{\xi }), \end{aligned}$$
(8)

i.e. as an infinite sum of orthogonal polynomials weighted by unknown deterministic coefficients \(c_j\) [54]. The vector \(\varvec{\xi } = (\xi _1, \dots , \xi _R)\) represents the multi-dimensional random vector, where each element is associated with uncertain input parameters, while \({\varvec{\phi }}_j(\varvec{\xi })\) are multivariate orthogonal polynomials, that can be decomposed into products of one-dimensional orthogonal polynomials with different variables.

We can approximate the infinite sum in Eq. 8 by truncating it at the \((P+1)\)-th term, such that:

$$\begin{aligned} X(\varvec{\xi }) \approx \sum _{j=0}^{P} c_j {\varvec{\phi }}_j(\varvec{\xi }). \end{aligned}$$
(9)

The number of unknown coefficients in this summation is given by \(P+1 = \frac{(p+R)!}{p!R!}\) [55], where p is the degree of the polynomial we are considering in the R-dimensional space.

When the parameters \(\xi _1, \dots , \xi _R\) are independent, \( {\varvec{\phi }}_j(\varvec{\xi })\) can be decomposed into products of one-dimensional functions:

$$\begin{aligned} {\varvec{\phi }}_j(\varvec{\xi })= & {} {\varvec{\phi }}_j(\xi _1, \dots , \xi _R)\nonumber \\= & {} \prod _{k=1}^R \phi _k^{d_k}(\xi _k),\ ~~~~j=0,\dots ,P, \nonumber \\ d_k= & {} 0,\dots ,p ~~~~~ \text {s.t.} ~ \sum _{k=1}^{R}d_k\le p. \end{aligned}$$
(10)

To determine the PCE, we need to find out the polynomial chaos expansion coefficients \(c_j\) for \(j = 0, \dots , P\), and the one-dimensional orthogonal polynomials \(\phi _k^{d_k},~ k=1,\dots ,R\), of degree \(d_k\).

Based on the work of Askey and Wilson [56], we can provide the orthogonal polynomials for different distributions. One of the possible choices is represented by the Gaussian distribution with the related Hermite polynomials.

The estimation of the coefficients of PCE can then be carried out in different ways [57, 58]. One method involves using a projection technique based on the orthogonality of the polynomials. Another method, which we will describe, is a regression-based approach.

In order to determine the coefficients \(c_j\), we need to solve a minimization problem:

$$\begin{aligned} \textbf{c} = \mathrm {arg\,min}_{\textbf{c}^*\in {\mathbb {R}}^P} \frac{1}{N_{\text {PCE}}}\sum _{i=1}^{N_{\text {PCE}}} \left( \hat{X} - \sum _{j=0}^{P} c^*_j {\varvec{\phi }}_j(\varvec{\xi }^{i})\right) , \end{aligned}$$
(11)

where \(N_{\text {PCE}}\) indicates the total number of realizations of the input vector we are considering, whereas \(\hat{X}\) represents the real output of the model. To solve Eq. 11 we consider the following matrix:

$$\begin{aligned} {\varvec{\Phi }}= \begin{pmatrix} {\varvec{\phi }}_0(\varvec{\xi }^{1}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{1}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{1}) \\ {\varvec{\phi }}_0(\varvec{\xi }^{2}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{2}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{2}) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\varvec{\phi }}_0(\varvec{\xi }^{N_{\text {PCE}}}) &{} {\varvec{\phi }}_1(\varvec{\xi }^{N_{\text {PCE}}}) &{} \cdots &{} {\varvec{\phi }}_{P}(\varvec{\xi }^{N_{\text {PCE}}}) \end{pmatrix}. \end{aligned}$$
(12)
Fig. 2
figure 2

Schematic structure of a Feedforward Neural Network with 2 hidden layers

Thus, the solution of Eq. 11 is computed using a least-square optimization approach, as shown in Eq. 13:

$$\begin{aligned} \textbf{c} = ({\varvec{\Phi }}^T{\varvec{\Phi }})^{-1} {\varvec{\Phi }}^T \hat{X}. \end{aligned}$$
(13)

It is important to emphasize that if the matrix \({\varvec{\Phi }}^T{\varvec{\Phi }}\) is ill-conditioned, as may occur, then the singular value decomposition method should be employed.

2.2.2 Feedforward neural network

A Feedforward Neural Network (FNN), also known as multilayer perceptron, is a popular neural network model commonly used for function regression [52]. As depicted in Fig. 2, it mainly comprises an input layer, an output layer, and a certain number of hidden layersFootnote 1, where the processing units composing them are called neurons. Each neuron is then characterized by a weight vector that determines the strength of its connections with neurons in the subsequent layer.

From a more technical perspective, let \(\tilde{{\textbf{x}}}\in {\mathbb {R}}^{n_{\text {in}}}\) represent the input vector and M denote the total number of hidden layers in the FNN. The output vector \({\textbf{h}}\in {\mathbb {R}}^{n_{\text {out}}}\) is obtained by applying an activation function to the weighted sum of all the inputs received by that neuron. The role of this activation function is to introduce non-linearity in the network. There are numerous options available [10, 60], and some common ones are represented by the ReLU function, sigmoid, logistic function, and radial activation functions.

To better understand the derivation of the general formula (15), we start by considering a FNN that comprises a single output and one hidden layer. In this scenario, the final output can be expressed as:

$$\begin{aligned} {\textbf{h}}= \sigma \left( \sum _{i=1}^{n_{\text {in}}} w_i \tilde{x}_i +b_i\right) , \end{aligned}$$
(14)

where \(\sigma \) is the activation function, \(W = \{w_i\}_{i=1}^{n_{\text {in}}}\) represents the weights of the net and b the biasFootnote 2. Therefore, when considering M layers, the final output can be seen as a weighted sum of its inputs followed by the activation function, where each input can be rewritten using the same approach described in Eq. 14:

$$\begin{aligned} h_j= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \tilde{x}^{(M)}_i\right) \nonumber \\= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \left( \sigma \left( \sum _{q=1}^{n_{M-1}} w^{(M)}_{iq} \tilde{x}^{(M-1)}_q\right) \right) \right) =\nonumber \\ \dots= & {} \sigma \left( \sum _{i=1}^{n_{M}} w^{(M+1)}_{ji} \left( \sigma \left( \sum _{q=1}^{n_{M-1}} w^{(M)}_{iq} \left( \sigma \left( \dots \left( \sigma \left( \sum _{k=1}^{n_{in}} w^{(1)}_{sk} \tilde{x}_k\right) \right) \right) \right) \right) \right) \right) ,\nonumber \\ j= & {} 1, \dots , n_{\text {out}}, \end{aligned}$$
(15)

where \(n_m\), \(m=1,\dots ,M\), represents the number of neurons in layer m, whereas \(n_{\text {in}}\) and \(n_{\text {out}}\) are the neurons in the input and output layers respectively. \(W^m= (w_{ki}^{(m)})_{ki},~ k=1,\dots ,n_m, ~ i=1,\dots ,n_{m-1}\) indicates then the weight matrix related to layer m. Note that the first number in any weight’s subscript matches the index of the neuron in the next layer and the second number matches the index of the neuron in the previous layer.

Fig. 3
figure 3

Graphical representation of the reduction method proposed for a CNN

Once we have constructed an FNN by choosing its architecture, we need to gain a performing model for a desired task. One of the main characteristics of an FNN is indeed its ability to learn from observational data during the so-called training process. In this phase, the net acquires knowledge from our dataset by minimizing the loss functionFootnote 3\(\mathcal {L}\):

$$\begin{aligned} \min _W\left\{ \frac{1}{n_{\text {out}}}\sum _{i=1}^{n_{\text {out}}} \mathcal {L}(h_i,\hat{h}_i) \right\} , \end{aligned}$$
(16)

where \({\textbf{h}}= \{h_j \}_{j=0}^{n_{\text {out}}}\) represents the expected output and \(\hat{{\textbf{h}}}= \hat{{\textbf{h}}}(\tilde{{\textbf{x}}}; W)= \{\hat{h}_j(\tilde{{\textbf{x}}}; W) \}_{j=0}^{n_{\text {out}}}\) is the prediction made by our FNN. To solve this minimization problem, the Backpropagation algorithm [62] is commonly employed. Consequently, the model’s parameters are optimized by adjusting the network’s weights using the following procedure:

$$\begin{aligned} w_{ki}^{(m),t} = w_{ki}^{(m), t-1} -\epsilon \frac{d\mathcal {L}}{dw_{ki}^{(m)}}, \end{aligned}$$
(17)

where \(\epsilon \) is the learning rate, which is appropriately chosen according to the problem under consideration. The parameter t represents the training epoch, which indicates a complete repetition of the parameter update involving the complete training dataset at once. The gradients required for the weight update in Eq. 17 are then computed using the chain rule.

3 The reduced artificial neural networks

In this section, we provide the rigorous description of the proposed framework, which is summarized in Fig. 3 and Fig. 1. The primary objective of our framework is to reduce, in terms of dimensionality, a generic Artificial Neural Network (ANN). Indeed, it is important to note that the only assumption we make about the original network is that it consists of L layers.

Network splitting

In the beginning, the original network, denoted as \({\mathcal {ANN}}: {\mathbb {R}}^{n_0} \rightarrow {\mathbb {R}}^{n_L}\) is split into two distinct parts. The first l layers constitute the pre-model, while the last \(L-l\) layers form the so-called post-model. By describing the network as composition of functions \({\mathcal {ANN}}\equiv f_L \circ f_{L-1} \circ \dots \circ f_1\), we can formally define the pre- and the post-model as follows:

$$\begin{aligned} {\mathcal {ANN}}_{\text {pre}}^l= & {} f_l \circ f_{l-1} \circ \dots \circ f_1,\nonumber \\ {\mathcal {ANN}}_{\text {post}}^l= & {} f_L \circ f_{L-1} \circ \dots \circ f_{l+1}, \end{aligned}$$
(18)

where each function \(f_j: {\mathbb {R}}^{n_{j-1}} \rightarrow {\mathbb {R}}^{n_j}\) for \(j=1,\dots ,L\), represents the different layers of the network — e.g. convolutional, fully connected, batch-normalization, ReLU, pooling layers. The original model can then be rewritten as:

$$\begin{aligned} {\mathcal {ANN}}({\textbf{x}}^0) = {\mathcal {ANN}}^l_{\text {post}}({\mathcal {ANN}}^l_{\text {pre}}({\textbf{x}}^0)), \end{aligned}$$
(19)

for any \(1\le l < L\) and \({\textbf{x}}^0 \in {\mathbb {R}}^{n_0}\).

As described in [35], the reduction of the network is achieved by approximating the post-model, which means that the pre-model is actually copied from the original network to the reduced one. Before proceeding with the algorithmic explanation of how the post-model is approximated, we specify that the index l, denoting the cut-off layer, is the only parameter of this initial step, and it plays an important role in the final outcome. This index indeed determines how many layers of the original network are retained in the reduced architecture, controlling, in a few words, how much information of the original network we are discarding. As described in [35], it is then chosen empirically based on considerations about the network and the dataset at hand, balancing the final accuracy and the compression ratio.

figure a

Pseudo-code for the construction of the reduced Artificial Neural Network.

Dimensionality reduction

As mentioned earlier, our goal is to project the output \({\textbf{x}}^{(l)}\) of the pre-model onto a lower-dimensional space using reduction techniques as:

  • Active Subspaces: as described in Section 2.1.1 and in [35], we consider a function \(g_l\) defined by:

    $$\begin{aligned} g_l({\textbf{x}}^{(l)}) = \text {loss} ({\mathcal {ANN}}^l_{\text {post}}({\textbf{x}}^{(l)})), \end{aligned}$$
    (20)

    in order to extract the most important directions and determine the projection matrix used to reduce the pre-model output.

  • Proper Orthogonal Decomposition: as discussed in Section 2.1.2, the SVD decomposition (6) is exploited to compute the projection matrix \(\varvec{\Psi }_r\) and subsequently obtain the reduced solution

    $$\begin{aligned} {\textbf{z}}= \varvec{\Psi }^T_r{\textbf{x}}^{(l)}. \end{aligned}$$
    (21)

It is important to emphasize that in order to apply these methodologies to the pre-model output, a flattening of \({\textbf{x}}^{(l)}\) should be carried out. These approaches are specifically based on flat-view matrix models, requiring the transformation of \({\textbf{x}}^{(l)}\) from a tensorial structure to a two-dimensional one.

Input-Output mapping

The final part of the reduced neural network is dedicated to classifying the output generated by the reduction layer. Two different techniques have been employed for this purpose:

  • the Polynomial Chaos Expansion, as introduced in Section 2.2.1. According to Eq. 9, the final output of the network, denoted as \({\textbf{y}}={\mathcal {ANN}}({\textbf{x}}^0)\in {\mathbb {R}}^{n_L}\), which represents the true response of the model, can be approximated as follows:

    $$\begin{aligned} \hat{{\textbf{y}}}\approx \sum _{|{\varvec{\alpha }}|=0}^{p}{\textbf{c}}_{{\varvec{\alpha }}}{\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}}), \qquad |{\varvec{\alpha }}|=\alpha _1+\dots +\alpha _r, \end{aligned}$$
    (22)

    where \({\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}})\) are the multivariate polynomial functions chosen based on the probability density function \(\rho \) associated with \({\textbf{z}}\). Therefore, the estimation of coefficients \({\textbf{c}}_{\alpha }\) is carried out by solving the minimization problem (11):

    $$\begin{aligned} \min _{c_{\alpha }}\frac{1}{N_{\text {train}}}\sum _{j=1}^{N_{\text {train}}}\left\Vert {\textbf{y}}^j-\sum _{|{\varvec{\alpha }}|=0}^{p}{\textbf{c}}_{{\varvec{\alpha }}}{\varvec{\phi }}_{{\varvec{\alpha }}}({\textbf{z}}^j)\right\Vert ^2. \end{aligned}$$
    (23)
  • a Feedforward Neural Network, as described in Section 2.2.2. In this case, the output of the reduction layer \({\textbf{z}}\) coincides with the network input. By applying Eq. 15, we can determine the final output \(\hat{{\textbf{y}}}\) of the reduced netFootnote 4, which is given by:

    $$\begin{aligned} \hat{y}_j= & {} \sum _{i=1}^{n_1}w_{ji}^{(2)}z^{(1)}_i\nonumber \\= & {} \sum _{i=1}^{n_1}w_{ji}^{(2)} \sigma \left( \sum _{m=1}^{r} w_{im}^{(1)} z_{m}\right) , ~~ j = 1,\dots ,n_{\text {out}}, \end{aligned}$$
    (24)

    where \(n_{\text {out}}\) corresponds to the number of categories that compose the dataset under consideration, and \(\sigma \) is the Softplus function:

    $$\begin{aligned} \text {Softplus}({\textbf{x}}) = \frac{1}{\beta }\log (1+\exp (\beta {\textbf{x}})). \end{aligned}$$
    (25)

3.1 Training phase

Once the reduced version of the network is constructed, we need to train it. Following [35], for the training phase of the reduced ANN, we employ the technique of knowledge distillation [31]. A knowledge distillation framework involves a large pre-trained teacher model, which is our full network, and a small student model, in our case \({\mathcal {ANN}}^{\text {red}}\). Therefore, the main goal is to efficiently train the student network under the guidance of the teacher network to achieve comparable or even superior performance.

Let \({\textbf{y}}\) be a vector of logits, which refers to the output of the last layer in a deep neural network. The probability \(p_i\) that the input belongs to the i-th class is determined by the softmax function:

$$\begin{aligned} p_i = \frac{exp(y_i)}{\sum _{j=0}^{n_{\text {class}}} exp(y_j)}. \end{aligned}$$
(26)

As described in [31], a temperature factor T needs to be introduced in order to control the importance of each target:

$$\begin{aligned} p_i = \frac{exp(y_i/T)}{\sum _{j=0}^{n_{\text {class}}} exp(y_j/T)}, \end{aligned}$$
(27)

where if \(T\rightarrow \infty \) all classes have the same probability, whereas if \(T\rightarrow 0\) the targets \(p_i\) become one-hot labels.

Firstly, we need then to define the distillation loss, which matches the logits between the teacher model and the student model, as mentioned in [35]. The knowledge transfer from the teacher to the student is accomplished by mimicking the final prediction of the full net, using response-based knowledge. Therefore, in this case, the distillation loss [31, 32] is given by:

$$\begin{aligned} L_D(p({\textbf{y}}_t, T), p({\textbf{y}}_s, T)) = \mathcal {L}_{\text {KL}}(p({\textbf{y}}_t,T), p({\textbf{y}}_s, T)), \end{aligned}$$
(28)

where \({\textbf{y}}_t\) and \({\textbf{y}}_s\) indicate the logits of the teacher and student networks, respectively, while \(\mathcal {L}_{\text {KL}}\) represents the Kullback-Leibler (KL) divergence loss [63]:

$$\begin{aligned}{} & {} \mathcal {L}_{\text {KL}}((p({\textbf{y}}_s,T), p({\textbf{y}}_t, T))\nonumber \\{} & {} \quad = T^2 \sum _j p_j(y_{t,j}, T)\log \frac{p_j(y_{t,j}, T)}{p_j(y_{s,j}, T)}. \end{aligned}$$
(29)

The student loss is then defined as the cross-entropy loss between the ground truth label and the logits of the student network [32]:

$$\begin{aligned} L_S({\textbf{y}}, p({\textbf{y}}_s,T)) = \mathcal {L}_{\text {CE}}(\hat{{\textbf{y}}}, p({\textbf{y}}_s,T)), \end{aligned}$$
(30)

where \(\hat{{\textbf{y}}}\) is a ground truth vector, characterized by having only the component corresponding to the ground truth label on the training sample set to 1, while the other components are set to 0. \(\mathcal {L}_{\text {CE}}\) represents instead the cross entropy loss, which is described as follows:

$$\begin{aligned} \mathcal {L}_{\text {CE}} (\hat{{\textbf{y}}}, p({\textbf{y}}_s,T))=\sum _i -\hat{y}_i \log (p_i(y_{s,i}, T)). \end{aligned}$$
(31)

As can be observed, both losses, Eqs. 28 and 30, use the same logits of the student model but with different temperatures. In the distillation loss, the temperature T is set to a value greater than 1 (\(T=\tau >1\)) while in the student loss, the temperature is set to 1 (\(T=1\)). Finally, the final loss is calculated as a weighted sum between the distillation loss and the student loss:

$$\begin{aligned} L({\textbf{x}}^0, W)= & {} \lambda L_D(p({\textbf{y}}_t, T=\tau ), p({\textbf{y}}_s, T=\tau )) \nonumber \\{} & {} + (1-\lambda ) L_S(\hat{{\textbf{y}}}, p({\textbf{y}}_s , T=1)), \end{aligned}$$
(32)

where \(\lambda \) is the regularization parameter, \({\textbf{x}}^0\) represents an input vector from the training set, and W coincides with the parameters of the student model.

4 Numerical results

In this section, we present a comparison between the results obtained using different reduction methods in terms of final accuracy, memory allocation, and procedure speed.

4.1 Neural network architectures

We used Convolutional Neural Networks (CNNs) as a test network, which is a type of ANN commonly applied to image recognition problems [64, 65]. In the past decade, several CNN architectures have been introduced [11, 61] to address this problem, such as AlexNet, ResNet, Inception, VGGNet.

As starting point for testing our methods, we have employed one of the VGG network architectures, specifically VGG-16 [66]. As shown in Fig. 4, this architecture consists of the following components:

  • 13 convolutional blocks. Each block includes a convolutional layer followed by a non-linear layer, where ReLU is used as the activation function.

  • 5 max-pooling layers,

  • 3 fully-connected layers.

Fig. 4
figure 4

Graphical representation of VGG-16 architecture

The ConvNet used in our study is called VGG-16, as it is composed of a total of 16 layers with tunable parameters. Out of these 16 layers, 13 are convolutional layers, and the remaining 3 are fully connected layers.

In comparison, we also tested our methodology on ResNet [67], and in particular on ResNet-110, as done in [35]. As the name suggests, ResNet-110 comprises a total of 110 layers. These layers are divided into 3 groups, each containing 18 basic residual blocks. We recall that these blocks consist of two convolutional layers, followed by batch normalization, and a skip/shortcut connection that adds the input to the output of the block.

4.2 Dataset

For training and testing our net we have usedFootnote 5:

  • CIFAR-10 dataset [68], a computer-vision dataset used for object recognition. It comprises 60000 color images of size \(32\times 32\), which are divided into 10 non-overlapping classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

  • Custom dataset, composed of 3448 color images of size \(32\times 32\), organized in 4 classes: 3 non-overlapping classes and a mixed one, characterized by pictures with objects of different categories present at the same time.

  • CIFAR-100 dataset [68], another benchmark computer-vision dataset for object recognition. It consists of 60000 color images of size \(32\times 32\), divided into 100 classes, with each class containing 600 images.

4.3 Software and hardware configuration

To implement and construct the reduced version of the convolutional neural networks described in the previous sections, we utilized PyTorch [69] as our development environment. We also employed the open-source Python library SciPy [70] for scientific computing and the open-source Python package ATHENA [46] for the actual computation of the active subspaces.

Regarding the hardware configuration, we ran all experiments involving VGG-16, except for the CIFAR-100 dataset, on the CPU. All other tests were performed using an NVIDIA GPU. This decision was influenced by the availability of hardware resources during the development and testing phases for the selected architectures.

4.4 Results VGG-16

We now present the results of the reduced network constructed starting from VGG-16 and based on CIFAR-10, CIFAR-100 and our custom dataset. First of all, the original network VGG-16 has been trainedFootnote 6 on each of the different datasets presented. We needed only a 60 epochs training phase for CIFAR-10 and the custom case, whereas a longer training of 300 epochs was required for CIFAR-100. From Tables 2 and 3, it can be seen that at the end of these learning processes, VGG-16 gains good accuracy: \(77.98\%\) for the CIFAR-10 and \(95.65\%\) for the custom dataset. Table 4 provides instead the accuracy achieved in the CIFAR-100 case, presenting the Top-1 and Top-5 scores, as done in [35]. It can be observed that the increase in the number of classes has resulted in a lower Top-1 value, as well as the need for longer training.

We report the results obtained with different reduced versions of VGG-16 constructed following the steps of Algorithm 1 and using several cut-off layersFootnote 7 l, as reported in [35]: 5, 6, and 7 for CIFAR-10 and the custom case, 7, 8 and 9 for the other dataset. We remark that in the case of dimensionality reduction using the Active Subspaces technique, we employed the Frequent Direction method [71], which was implemented within ATHENA to compute the AS. We set the parameter r, representing the dimension of the reduced space, to 50 for both AS and for POD in accordance with [35], where considerations on the structural analysis of VGG-16 can be found.

Table 1 Results obtained for the reduced net POD+FNN (7) trained on CIFAR-10 with different structures for the FNN
Table 2 Results obtained with CIFAR-10 dataset
Table 3 Results obtained with a custom dataset
Table 4 Results obtained with CIFAR-100 dataset

When a FNN was employed to classify our image, we trained it for 500 epochs with the dataset at hand before re-training the entire reduced net. In Table 1, we provide a summary of the results obtained by training a reduced net using various FNN architectures. This includes different numbers of hidden layers and constant numbers of hidden neurons within each hidden layer of the network. Specifically, we compare the storage requirements of the FNN with the accuracy of the considered reduced network POD-FNN under consideration at epoch 0, i.e. after its initialization, and at epoch 10, i.e. after the re-training of the whole reduced net. From the results, it can thus be observed that increasing the number of hidden layers and hidden neurons does not result in improved accuracy. Based on accuracy and memory allocation considerations (refer to Table 1 for details), we opted for the following architecture:

  • CIFAR-10: FNN with 50 input neurons, 10 output neurons, and one hidden layer with 20 hidden neurons.

  • Custom Dataset: FNN with 50 input neurons, 4 output neurons, and one hidden layer with 10 hidden neurons.

  • CIFAR-100: FNN with 50 input neurons, 100 output neurons, and one hidden layer with 70 hidden neurons.

Table 5 Results obtained for POD+FNN(7) without using a pre-trained original network

After completing these steps, the reduced neural network was re-trained using CIFAR-10 and the custom dataset for a total of 10 epochs. Additionally, it was re-trained for 20 epochs specifically on the CIFAR-100 dataset. The outcomes of this training process are summarized in Tables 23, and in 4, presenting a comparison among various reduced neural networks in terms of accuracy (both before and after the final training, or using Top-1 and Top-5 scores), memory storage requirements, and the time needed for initialization and training of each reduced network. As mentioned earlier, we provide results for each reduced network, namely AS+PCE, AS+FNN, POD+FNN, using three different cut-off layers: 5,6, and 7 or 7, 8 and 9, depending on the case.

In our context, which specifically involves working with a custom dataset, understanding memory allocation is crucial. This is because we aim to include a CNN into an embedded system that has specific storage constraints. Tables 23 and 4 demonstrate that the memory allocation required for the created reduced nets is decreased with respect to that of the original VGG-16. For instance, the checkpoint fileFootnote 8 needed to store the full net occupies approximately 56 MB, whereas that of its reduced versions is less than 10 MB in most cases. It is then important to note that for CIFAR-100, opting for higher cut-off values results in a larger storage requirement due to the increased pre-model size. This emphasizes the significant role the cut-off index plays in the final model compression. Additionally, it is worth mentioning that replacing PCE with an FNN leads to a substantial memory space savings of two orders of magnitude: \(10^{-4}\) as opposed to \(10^{-2}\).

Table 6 Results obtained with CIFAR-10 dataset
Table 7 Results obtained with the custom dataset

Table 2 shows that in the case of POD+FNN, the net does not require an additional training with the entire dataset. This is because, after the initialization (epoch 0), the network’s accuracy is already acceptable, and for index 7, it is already high. Additionally, we observe that all proposed reduced nets require less time to achieve well-performing models. This is reasonable since the compression in size is strictly related to the decrease in the number of CNN parameters. However, while this holds true for CIFAR-10 and the custom dataset, the increased number of classes, and thus complexity, in CIFAR-100 necessitates longer training time.

Nevertheless, an interesting aspect of this reduction methodology is the non-necessity of having a pre-trained starting model to obtain an exploitable net, as summarized in Table 5. We provide the results obtained for our proposed reduced net POD+FNN(7), constructed without starting from the pre-trained VGG-16. It can be inferred that with all datasets POD+FNN achieves a comparable level of accuracy as in the previous cases where a pre-trained VGG-16 was employed. However, for CIFAR-10 and the custom dataset, we used the same number of epochs as the pre-trained case, whereas for CIFAR-100, it required twice the number of epochs to achieve the same level of accuracy. The immediate consequence of this is the saving of the time needed to gain a performing network, which amounts to approximately 5 hours. It is evident that these considerations remain valid even when using the custom dataset under consideration. Table 3 reports also how after the initialization POD+FNN has already a greater accuracy than VGG-16 for all the choices of l.

In all cases, it can be observed that the proposed reduced CNN achieves a similar, if not higher, accuracy compared to the original VGG-16, while occupying significantly less storage. Moreover,increasing the cut-off layer index l results in improved accuracy since more original features are retained. However, this also leads to a smaller compression ratio. Consequently, as previously mentioned, determining the appropriate value for l requires striking a trade-off between the desired levels of accuracy and reduction, considering also the specific field of application.

Table 8 Results obtained with CIFAR-100 dataset

4.5 Results ResNet-110

After obtaining interesting results with VGGNet, we proceeded to test our reduction methodology on ResNet-110, following the approach described in [35]. Initially, the network has been trained on each dataset for 60 epochs, achieving a good level of accuracy as reported in Table 6, in Tables 7, and 8. Similarly to the VGG-16 case, we provide the Top-1 and Top-5 accuracy scores for CIFAR-100.

Also in this setting, we have performed multiple experiments to determine the FNN architecture. In analogy with the approach outlined in Table 1 for reducing VGG-16, we used the same FNN structures described previously for VGG-16 across the different cases. Furthermore, for ResNet, the chosen reduced dimension r is also set to 50, based on the eigenvalue analysis presented in [35]. Numerous tests confirmed that this choice of r was optimal, as increasing its value did not yield improved results.

Once we finalized the compression and input–output mapping techniques, we proceeded to construct the reduced versions of our original model. Algorithm 1 describes the entire procedure, with the last step corresponding to the training phase. During this phase, we re-trained the proposed networks using the aforementioned datasets under consideration. We have thus re-train our reduced nets for 10 epochs in the case of CIFAR-10 and the custom dataset, and for 20 epochs with CIFAR-100. Tables 67 and 8 provide the outcomes obtained using the described experimental setup, comparing them in terms of the achieved accuracy, memory footprint, and time required for the initialization and learning processes. Similarly to what is explained in Section 4.4, we report the results for each proposed reduced net using three different cut-off valuesFootnote 9: 31, 33, 35 for CIFAR-10 and the custom dataset, and 37, 39, 43 for CIFAR-100. By combining the reduction and input–output mapping methods, we have constructed the following compressed models, AS+PCE, AS+FNN, POD+FNN, of which we are now going to analyze the performances.

In terms of memory allocation, it is worth noting that each of the aforementioned reduced nets requires less than 3 MB of space, resulting in a reduction of approximately \(60\%\) in the memory footprint. Furthermore, the introduction of an FNN in the final part of the method leads to a storage decrease of one order of magnitude.

In all cases, we can observe that the reduced networks achieved a level of accuracy comparable to the original ResNet-110. The advantage of constructing lightweight architectures is that they result in faster models in most situations. Specifically, we want to emphasize the POD+FNN net, since it consistently outperforms the other reduced networks in terms of achieved accuracy, storage requirements, initialization, and training times. Regarding the initialization process, we can observe that POD requires less time compared to AS, saving approximately one time hour. Furthermore, the training duration is similar to AS in the case of CIFAR-100 and the custom dataset, while it is faster for CIFAR-10.

In conclusion, based on the aforementioned considerations, we can deduce that the results obtained with ResNet-110 are generally in line with those previously achieved with VGG-16. The proposed reduced methodology enables the creation of lightweight versions of ResNet-110 that are equally accurate to the original model but have fewer parameters, making them more manageable to train.

5 Conclusions and perspectives

In this paper, we propose a generic framework for compressing neural networks, specifically Convolutional Neural Networks, with the objective of reducing the number of layers in the network while minimizing the error in the final prediction. This reduction is achieved by replacing a finite set of network layers with a response surface, which involves also dimensionality reduction techniques to operate on a low-dimensional space. We analyze various dimensionality reduction methods, and investigate how the combination of these techniques with different input-output mappings can impact the final accuracy.

The primary goal of creating this reduced network is to compress existing deep neural network architectures to be included in embedded systems with memory and space constraints. The numerical experiments conducted on two different CNNs, namely VGG-16 and ResNet-110, demonstrate that the proposed techniques can produce a compressed version of an existing network by reducing the number of layers and parameters. This reduction in size results in memory savings while maintaining a comparable level of accuracy to the original CNN. In comparison to VGG-16, the original ResNet-110 requires less storage space, approximately 7 MB, making it already suitable for many applications in vision-embedded systems. However, the use of smaller devices or specific requirements may necessitate a compressed and faster version of the network. Additionally, the results reveal that the combination of POD with FNN generally leads to reduced training time, making the proposed framework superior to the method presented in [35].

A potential drawback of this technique is the requirement to begin with a pre-trained network in order to reduce it. However, our experiments have demonstrated the non-necessity of this starting point to reach good accuracy with the proposed reduced architecture. Despite the saved space and memory, the actual bottleneck in many problems lies in the learning procedure. In such cases, our framework could be extended to reduce the architecture dimension during training, rather than only after its completion, potentially resulting in a significant speedup in the optimization step.

In conclusion, the conducted experiments illustrate the consistency of our proposed methodology when applied to different CNNs and datasets. While we cannot claim that this reduction framework can be universally applied to all existing types of ANNs, it has proven effective in compressing CNNs for image recognition tasks.