1 Introduction

To achieve more efficient Convolutional Neural Networks (CNNs), researchers have utilized numerous state-of-the-art methods in a diverse range of machine learning practices, namely object detection [1, 2], image recognition [3, 4], and web search [5]. However, achieving outstanding outcomes also add up immense challenges to cope with, such as cumbrous architectures that are not quite efficient in terms of memory and have heavy computational requirements, specifically for mobile and embedded devices. Additionally, they incur significant inference costs. Hence, model compression acquired noteworthy consideration amongst researchers to decrease the size issue of CNN architecture. Even so, by nature, CNNs are computationally intensive, along with memory footprint, the inevitability of the floating-point operations (FLOPS) dramatically increased [6]. The overall growth is because of trainable parameters quantity is in millions along with convolution operations.

Besides, the increase in parameter count, the runtime memory also plays a crucial role because even with a single image, the activation layer of CNN possibly uses up additional memory footprint rather than saving parameters throughout the inference period. To contend with this sort of challenge more robust Graphical Processing Units GPUs are presented as the solution which is not so affordable for numerous implementations. Thus, to alleviate the incompatibility of elevated resource necessity of CNNs numerous approaches have been offered in terms of compressing CNNs in a diverse range of models without taking any evident accuracy loss. Network pruning is quite a prevailing technique amongst researchers for the compression of networks. Pruning can help to lower computation costs through dropping the number of feature maps. The earlier pruning approaches are merely utilized for networks that are fully connected namely second-order derivatives [7] and optimal brain damage [8]. The obvious drawback of these approaches is that parameter pruning does not offer a substantial reduction in computation time since eliminated parameters are mostly from fully connected layers.

There has been considerable research available in the area of compressing CNN networks instead of taking rather efficient CNN models directly. To alleviate the conflict of the high resource necessity of the CNNs, the literature comprises a variety of approaches to aid in compressing along with accelerating CNNs in a diverse range of models with no noticeable accuracy loss. For instance, approaches based on quantization have been suggested to make CNNs more appropriate for devices with limited resources [9, 10]. However, these approaches have a common issue of reduced accuracy. However, there are multiple ways to address the accuracy reduction issue. For instance, perform only parameter quantization rather on activations, by increasing the size of network, and performing fine-tuning. Conversely, the extensively utilized approach amongst researchers to compress networks is pruning. However, there are two subclasses of pruning namely filter pruning [11,12,13] and weight pruning [14,15,16]. In weight pruning parameters are eliminated directly in the filter and that produces unstructured sparsity causes an impact on efficiency of the network. Recently, inspiring outcomes attained through Bayesian-based techniques that utilize weight pruning [17,18,19]. This class of pruning scheme is still not quite impactful to ease up computational power along with memory footprint, and in the meantime dedicated Basic Linear Algebra Subprograms (BLAS) libraries are also compulsory. Also, there has been a noteworthy amount of research available for filter-level pruning [20,21,22]. This class of pruning does not take in additional hardware and allows a structured model. Filter pruning has the upper hand regarding eliminating redundant filters along with reducing model size without harming the structure of model.

As another option, a Bayesian perception channel pruning alongside minimizing the bit precision for weights is associated with accomplishing higher-level accuracy, since the Bayesian approaches pursue the optimum structure of model. In this study, we have managed to create a Bayesian-based filter level pruning. In this work, Bayesian methods are applied for CNNs to evaluate uncertainty along with regularization within their predictions concerning their training. With the help of this scheme, the network can determine uncertainty through parameters as probability distributions. This provides the benefit of the regularization effect for the network that results in avoiding overfitting.

Further, In this work, we proposed a process of network compression by utilizing Bayes by Backprop [23] with approximate intractable true posterior probability distributions \(p(w\mid \mathcal{D})\) along with variational probability distributions \({q}_{\theta }(w\mid \mathcal{D})\). This however has the properties of the Gaussian distribution denoted as \(\mu \in {\mathbb{R}}^{d}\text{ and }\sigma \in {\mathbb{R}}^{d}\text{,}\) represented as \(\mathcal{N}\left(\theta \mid \mu ,{\sigma }^{2}\right)\), here number of parameters are denoted as d which define probability distribution. Besides, variance \({\sigma }^{2}\) defines the form of Gaussian variational posterior probability distributions. Moreover, to prune the network we utilized L1 and capped L1 norm to regulate the tradeoff amongst regularization and selection of filter. The filters of nearly all layers with small L1-norm are picked and set to zero. This arrangement helps channel level pruning in the subsequent step and the overall performance will barely be affected by the parameter regularization. However, pruning redundant channels might degrade the performance momentarily but it can be eased through fine-tuning the pruned network. This proposition offers a slimmer network along with a compact structure regarding model size, runtime memory, and computational cost when compared to other techniques. The proposed pruning approach is iterative as demonstrated in Fig. 1. We evaluate our proposed method on prevailing CNN architecture through several standard datasets.

Fig. 1
figure 1

Structure of iterative algorithm

2 Literature Work

Using CNN on embedded devices is a modern trend that has a strong influence on mobile computing applications. However, utilization of CNNs in numerous applications added impressive computational costs increase along with immense size, and because of advancements CNNs transformed into a wider architecture. Hence, gradually increased parameter size and therefore greatly affects the applicability of models on embedded devices. However, the substantial redundancy in parameterization turns out to be an extensively recognized quality [24]. The redundant nature and over-parameterization of neural networks attract an increase in memory requirements and computational costs. For instance, VGG-16 [25] needs almost 30 billion float point operations (FLOPs) along with 138 million parameters and with 500 MB required storage. This incurs a substantial problem and also confines several CNN applications.

Moreover, the Bayesian comparable neural networks are termed as Bayesian neural networks (BNNs), where values of parameters of a particular network are generally represented via probability distributions. The BNN models have a handful of perks as compared to the non-Bayesian neural networks for instance BNN models not only acknowledge for integrating prior knowledge but offer robustness to overfitting along with simple continual learning [26]. There is not much research available regarding Bayesian networks [27], the authors in [28] presented a variational Bayes (VB) deterministic approximation for moments of activations of neural networks along with a straightforward empirical Bayes hyper-parameter update. Thus, their work achieved robust and efficient results through a combination of these two approaches. Besides, in another work [29] ReLU nonlinearities are decomposed into a product of an identity along with a Heaviside step function. Further, they familiarized another path that helps in decomposing neural network expectations from variance. Their methodology contains distinct latent binary variables for activations which causes neural network likelihood to behave as a chain of linear operations. This formulation is more robust than the Monte Carlo [30] sampling approaches because it allows computation that is sampling free of evidence lower bound.

Recently, considerable advancement in devices with limited power resources have crafted outstanding prospects for researchers to cope with issues of deploying deep learning systems on mobile devices with inadequate resources namely memory [31]. Accomplishing these objectives necessitates the computational costs reduction and memory requirements which aids in broadening the deep learning models applicability and being able to employ in a wide variety of applications namely embedded systems, real-time applications, and mobile devices. Although, several approaches available in the literature regarding coping with compressing CNNs have been introduced recently [32, 33], pruning in this field emerges as a famous solution which is eliminating redundant weights out of initial networks. While techniques related to pruning were conceived in early era of the 1980s-1990s, and were able to be utilized in deep learning networks [32]. However, [7, 8] are pioneers of technique of channel pruning, they demonstrated that by eliminating redundant weights from a trained network along with negligible loss in accuracy. Subsequently, in [14, 34] they presented that weights with small magnitude have less information and can be pruned. But then these sophisticated approaches are unconstructed along with hold format weight matrix. This will restrain the acceleration effect except by adopting the Compressed Sparse Column (CSC).

On the assumption that [35] presented that before the step of retraining, there is a possibility to override the retraining phase by a random initialization. Further, in [35] presented, swapped fully connected layers with sparsely connected layers through utilizing initial topology based upon Erdȍs–Rényi random graph. Through the training of the network, portions of the smallest weights are discarded iteratively and then swapped with new random weights. To find a sparse architecture prior training step is accomplished by employing initial topology. Nonetheless, the disadvantage also lies in the random iterative initialization because these all steps are quite expensive. This approach also causes jumping memory access along with poor cache locality, which tremendously impacts and confines practical acceleration [36].

Subsequently, [37] suggested that for the deeper architectures pruning networks with initialization values fail to perform better. Their solution was to set up weights for those that are acquired at early epochs of training the network. In [38] sparsity is adapted in the model parameters alongside they necessitate the aid of sparse libraries as well to accomplish intended results. Likewise, the mentioned approach gives a deficient compression rate on total run memory (TRM) along with FLOPs. Nonetheless, these specific methods provide a finer compression rate concerning weight storage, along with insubstantial FLOPs. However, in [39] proposed filter importance holds particular limits as required that are not usually accumulated. They proposed a methodology based upon meta-attribute-based filter pruning (MFP). They have broadened the current magnitude information which is based on pruning criterion and they also familiarized a new standard to contemplate the geometric distance of filters. Consequently, models endure redundancy because of these methodologies. In the meantime, these methodologies cease to anticipate filter redundancy during pruning. To remove the redundant feature maps in [33] proposed an approach based on the correlation between feature maps that are generated out of corresponding filter. This technique eliminates the redundant feature maps to aid in reducing the size of a model along with reduced computational cost plus being able to save many FLOPs too.

Likewise, in [40] presented Distinguishing Layer Pruning based on RFC (DLRFC). They presented a novel filter criterion that employs network interpretability to aid in constructing a filter peak feedback set, subsequently estimating redundancy based upon the uniformity of the filter’s feedback toward the class. In this approach, they pruned filters in dissimilar layers discriminately. This helps in avoiding measuring filters amongst individual layers directly contrary to the filter criteria. Additionally, we have suggested how Bayes by Backprop can be applied to different CNN Models without trimming the network to half to make it similar to non-Bayesian CNN models because in this work we have utilized two convolutional operations for mean and variance which make the model double in the size as compare to non-Bayesian. We also inspected the aleatoric and epistemic uncertainties and made the network to become more deterministic. Further, we have applied L1-norm with capped L1-norm to help train the parameters of the model, prune the unimportant filters, and further fine-tune the model to reduce the accuracy loss that occurred during the pruning process.

In this work, we perform proposed approach on different datasets along with CNN models. For VGG16 on CIFAR100, the approach obtained 59.6% of parameter pruning along with 46.4% in FLOPs reduction and 0.17% accuracy loss. For VGG16 on CIFAR10, we achieved 44.6% parameter pruning without loss of accuracy. Further details of our proposed approach are briefly detailed in the coming sections of the article.

3 Proposed Methodology

3.1 Variational Inference

As we state function \(y=f\left(x\right)\) which approximates the inputs \(\left\{{x}_{1},\dots , {x}_{N}\right\}\) with relative outputs \(\left\{{y}_{1},\dots , {y}_{N}\right\}\), which generates an estimated output, respectively. A prior distribution is applied through a span of functions \(p(f)\) with utilization of Bayesian Inference. So, the distribution implies our prior beliefs regarding which exact functions are created in our data. Further, a likelihood can be stated as \(p\left(Y|f,X\right)\) to obtain the procedure where a function observation is formed. In that case, the Bayes rule is utilized to locate the posterior distribution considering our dataset \(p\left(f|X,Y\right)\).

So, through incorporating overall probable functions \(f\) a new output can be estimated for a new input point \({x}^{*}\),

$$p\left({y}^{*}\mid {x}^{*},X,Y\right)=\int p\left({y}^{*}\mid {f}^{*}\right)p\left({f}^{*}\mid {x}^{*},X,Y\right)d{f}^{*}$$
(1)

However, because of the integration symbol, Eq. 1 is intractable. But it can be approximated through utilizing a finite set of random variables such as \(w\) then condition model over it. This however makes the model dependent upon variables alone which results in sufficient statistics in our model. Hence, the distribution for a new input point \({x}^{*}\) can be written as follows

$$p\left({y}^{*}\mid {x}^{*},X,Y\right)=\iint p\left({y}^{*}\mid {f}^{*}\right)p\left({f}^{*}\mid {x}^{*},w\right)p\left(w|X,Y\right)d{f}^{*}dw$$
(2)

Nonetheless, \(p\left(w|X,Y\right)\) distribution still be intractable. A variational distribution \(q(w)\) is required to approximate it, though is computable. The approximating distribution should be closer to posterior distribution that is acquired from original model. We then have to minimalize Kullback–Leibler (KL) divergence, intuitively a similarity measure amongst two diverse distributions; \(KL(q(w)\parallel p(w\mid X,Y))\). This formation leads to approximate predictive distribution and becomes

$$q\left({y}^{*}\mid {x}^{*}\right)=\iint p\left({y}^{*}\mid {f}^{*}\right)p\left({f}^{*}\mid {x}^{*},w\right)q(w)d{f}^{*}dw$$
(3)

The process of minimizing KL divergence is similar to maximizing log-evidence lower bound

$${KL}_{VI}:=\int q\left(w\right)p\left({\text{F}}\mid {\text{X}},{\text{w}}\right)logp\left({\text{Y}}\mid {\text{F}}\right)dFdw-KL(q(w)\Vert p(w))$$

referring to variational parameters determining \(q(w)\), which is fundamentally a variational inference. Further, maximizing Kullback–Leibler (KL) divergence amongst both posterior and prior with \(w\) might provide a variational distribution which exactly learns a finer description out of data, i.e., achieved from log-likelihood, this makes it near to prior distribution.

3.2 Bayes by backprop utilization

With the aim of computing intractable true posterior probability distribution, we utilized variational inference such as Bayes by Backprop in our case. In order to build a CNN along with probability distributions above its weights in every filter. Besides, it is not possible to achieve a fully Bayesian perspective on a CNN through placing probability distributions over weights in convolutional layers, but also needs probability distributions over weights [41].

For a Bayesian neural network to learn the posterior distribution over weights \(w\sim {q}_{\theta }(w\mid \mathcal{D})\) where weights w is sampled in backpropagation. Bayes by backprop familiarized by [23] (basically implies a more empirical solution for the challenge of intractability can be solved sufficiently. Additionally, this particular distribution through minimizing the compression cost can be able to regularize weights. This termed as variational free energy. However, true posterior is intractable which leads to approximate distribution \({q}_{\theta }(w\mid \mathcal{D})\) which is then intended to mimic true posterior \(p(w\mid \mathcal{D})\) and this calculated through the KL-divergence [42]. Thus, this formation determines optimal parameters \({\theta }^{\text{opt}}\) as follows:

$$\begin{array}{rl}{\theta }^{\text{opt }}=&\,\underset{\theta }{{\text{arg}}min}KL\left[{q}_{\theta }(w\mid \mathcal{D})\parallel p(w\mid \mathcal{D})\right]\\ =&\,\underset{\theta }{{\text{arg}}min}KL\left[{q}_{\theta }(w\mid \mathcal{D})\parallel p(w)\right]\\ &-{\mathbb{E}}_{q(w\mid \theta )}[logp(D\mid w)]+logp(D)\end{array}$$
(4)

now KL-divergence is defined as

$${\text{KL}}\left[{q}_{\theta }(w\mid \mathcal{D})\parallel p(w)\right]=\int {q}_{\theta }(w\mid \mathcal{D}){\text{log}}\frac{{q}_{\theta }(w\mid \mathcal{D})}{p(w)}dw.$$
(5)

The outcome of the above derivation is an optimization issue including a resulting cost function termed as “variational free energy” [43, 44]. This cost function carried out by two terms i.e. former, \({\text{KL}}\left[{q}_{\theta }(w\mid \mathcal{D})\parallel p(w\mid \mathcal{D})\right]\) which basically rely on prior \(p(w)\), termed as complexity cost, while the later term, \({\mathbb{E}}_{q(w\mid \theta )}[{\text{log}}p(\mathcal{D}\mid w)]\) rely on data \(p(\mathcal{D}\mid w)\), termed as likelihood cost. However, in the optimization, we can exclude \({\text{log}}p(\mathcal{D})\) since it is constant. Furthermore, we cannot exactly compute KL-divergence due to its intractable nature so we have to utilize another term such as stochastic variational scheme [23]. The weights w sampled from \({q}_{\theta }(w\mid \mathcal{D})\) variational distribution, considering the much easier approach to extract the samples that are relevant to numerical approaches from variational posterior \({q}_{\theta }(w\mid \mathcal{D})\) compared with true posterior \(p(\mathcal{D}\mid w)\) in this case. Hence a tractable cost function is formulated in Eq. 6 which is intended to be minimized along with optimized regarding \(\theta \), throughout training session.

$$\mathcal{F}(\mathcal{D},\theta )\approx \sum_{i=1}^{n} {\text{log}}{q}_{\theta }\left({w}^{(i)}\mid \mathcal{D}\right)-{\text{log}}p\left({w}^{(i)}\right)-{\text{log}}p\left(\mathcal{D}\mid {w}^{(i)}\right)$$
(6)

In the above equation, n represents the number of extractions performed. Moreover, \({w}^{(i)}\) is sampled out of \({q}_{\theta }(w\mid \mathcal{D})\).

3.3 Local Reparameterization Trick

In this work, local reparameterization trick [45] is utilized for CNNs. Further, in our work layer activation b is sampled instead of weights w because of resulting computational acceleration of layer activation b. Moreover, \(q_{\theta } \left( {w_{ijhw} {\mathcal{D}}} \right) = {\mathcal{N}}\left( {\mu_{ijhw} ,\alpha_{ijhw} \mu_{ijhw}^{2} } \right)\) represents variational posterior probability distribution and for any layer, i and j are input along with output layers, where h and w denoted as height and width, respectively. This formation within convolutional layer can permit to apply local reparameterization trick. The convolutional layer activation b can be written as follows:

$${b}_{j}={A}_{i}*{\mu }_{i}+{\epsilon }_{j}\odot \sqrt{{A}_{i}^{2}*\left({\alpha }_{i}\odot {\mu }_{i}^{2}\right)}$$
(7)

here receptive field is represented as \({\epsilon }_{j} \mathcal{N}(\mathrm{0,1})\),\({A}_{i}\) along with \(\odot \) which is component-wise multiplication, and signalizes convolutional operation denoted as \(*\).

We deployed a different estimator for which \(Cov\left[{L}_{i},{L}_{j}\right]=0\), so the variance of the stochastic gradients scales as \(1/M\). After that new estimator is constructed computationally efficient through sampling intermediate variables along with not directly sampling \(\epsilon \) though \(f(\epsilon )\) with which \(\epsilon \) impacts \({L}_{D}^{SGVB}(\varnothing )\). With this source of global noise can interpreted into local noise \((\epsilon \to f\left(\epsilon \right))\). Hence, to achieve an effective gradient estimator local reparameterization is applied. For instance, an input (X) is stated as a random uniform function with values between (-1 to + 1) with an output of (Y) denoted as a random normal distribution over mean X along with standard deviation \(\delta \), and \({\left(Y-X\right)}^{2}\) stated as Mean Squared Loss. During backpropagation from random normal distribution leads to this issue and due to the reason of propagating over a stochastic node, then we reparametrize it with adding X in order to random normal distribution output and then multiply it with standard deviation.

3.4 Utilizing Sequential Convolutional Operations

The crucial point of CNN which has probability distributions over weights rather than single point estimates which is also capable of updating variational posterior probability distribution \({q}_{\theta }(w\mid \mathcal{D})\) through backpropagation exists in utilizing more than one convolutional operation but filters with a single point estimates utilize only one operation. Moreover, this work utilized local parameterization trick along with sample out of output b. Given that b is considered as a function of mean \({\mu }_{ijwh}\) and variance \({\alpha }_{ijhw}{\mu }_{ijhw}^{2}\), this help to separately calculate two variables defining a Gaussian probability distribution such as of mean \({\mu }_{ijwh}\) and variance \({\alpha }_{ijhw}{\mu }_{ijhw}^{2}\). The detailed explanation is defined as follows:

  • Output b is treated as an output of CNN which is updated through frequent inference. Adam optimizer [46] is utilized in order to obtain a single-point estimate. This single point-estimate is interpreted as mean \({\mu }_{ijwh}\text{ .}\)

  • In the second convolutional operation, we familiarize variance \({\alpha }_{ijhw}{\mu }_{ijhw}^{2}\), here variance comprises mean \({\mu }_{ijwh}\) so requires to learn \({\alpha }_{ijhw}\) section in second convolutional operation [18]. This formulation ascertains that one parameter is updated for every convolutional operation.

3.5 Predictive Uncertainty for Convolutional Neural Networks (CNNs)

To estimate the uncertainties, we modeled Epistemic uncertainty through putting prior distributions over weights of the model and attempting to apprehend how often the weights deviate provided with some data. Besides, we modelled Aleatoric uncertainty though placing distribution over the model output.

The predictive distribution \({p}_{\mathcal{D}}\left({y}^{*}\mid {x}^{*}\right)\), is the main concern in classification tasks, where \({y}^{*}\) is considered a predictive class and \({x}^{*}\) represents unseen data example. As for Bayesian neural network concern, predictive distribution can be written as:

$${p}_{\mathcal{D}}\left({y}^{*}\mid {x}^{*}\right)=\int {p}_{w}\left({y}^{*}\mid {x}^{*}\right){p}_{\mathcal{D}}(w)dw$$
(8)

The finite and discrete characteristics of majority of classification tasks lead to assuming predictive distribution to be categorical which results in;

$$\begin{array}{c}{p}_{\mathcal{D}}\left({y}^{*}\mid {x}^{*}\right)=C\int \left({y}^{*}\mid {f}_{w}\left({x}^{*}\right)\right)\mathcal{N}\left(w\mid \mu ,{\sigma }^{2}\right)dw\\ =\int \prod_{c=1}^{C} f{\left({x}_{c}^{*}\mid w\right)}^{{y}_{c}^{*}}\frac{1}{\sqrt{2\pi {\sigma }^{2}}}{e}^{-\frac{(w-\mu {)}^{2}}{2{\sigma }^{2}}}dw\end{array}$$
(9)

In Eq. 9C represents an overall number of classes along with \(\sum_{c}f\left({x}_{c}^{*}\mid w\right)=1\). Moreover, an unbiased estimator of expectation can be created via sampling from \({q}_{\theta }(w\mid \mathcal{D})\). This is a must because of insufficient conjugacy amongst Gaussian distribution and categorical. Now we can formulate as follows:

$$\begin{array}{c}{\mathbb{E}}_{q}\left[{p}_{\mathcal{D}}\left({y}^{*}\mid {x}^{*}\right)\right]=\int {q}_{\theta }(w\mid D){p}_{w}(y\mid x)dw\\ \approx \frac{1}{T}\sum_{t=1}^{T} {p}_{{w}_{t}}\left({y}^{*}\mid {x}^{*}\right)\end{array}$$
(10)

In above equation T represents pre-defined samples. Now this estimator is helpful to measure our uncertainty through definition of variance. This formation is termed as “predictive variance” represented as \({Var}_{q}\) and defined as follows:

$${Var}_{q}\left(p\left({y}^{*}\mid {x}^{*}\right)\right)={\mathbb{E}}_{q}\left({yy}^{T}\right)-{\mathbb{E}}_{q}\left[y\right]{\mathbb{E}}_{q}{\left[y\right]}^{T}$$
(11)

Now we can fetch the epistemic uncertainty and aleatoric uncertainty from Eq. 11 and given as follows:

$$ {\text{Var}}_{q} \left( {p\left( {y^{*} {\mid }x^{*} } \right)} \right) = \underbrace {{\frac{1}{T}\mathop \sum \limits_{t = 1}^{T} \;{\text{diag}}\left( {\hat{p}_{t} } \right) - \hat{p}_{t} \hat{p}_{t}^{T} }}_{{\text{aleatoric }}} + \underbrace {{\frac{1}{T}\mathop \sum \limits_{t = 1}^{T} \left( {\hat{p}_{t} - \overline{p}} \right)\left( {\hat{p}_{t} - \overline{p}} \right)^{T} }}_{{{\text{epistemic}}}} $$
(12)
$$\text{where }\overline{p }=\frac{1}{T}\sum_{t=1}^{T} {\widehat{p}}_{t}\text{ an}\text{d }{\widehat{p}}_{t}={\text{Softmax}}\left({f}_{{w}_{t}}\left({x}^{*}\right)\right)$$

Moreover, to weigh how much a model can be improved, it is necessary to discriminate between two types of uncertainty: aleatoric and epistemic. Aleatoric uncertainty (or statistical uncertainty) reveals the variability of the data, which might be incomplete or noisy. while Epistemic uncertainty arises from the model, which might be incomplete or inaccurate. As a result of separating these two causes of uncertainty, a modeler can detect whether the quality of data is poor (high aleatoric uncertainty) or the quality of the model is poor (high epistemic uncertainty).

3.6 Model Pruning

3.6.1 Filter Pruning

The CNN is parametrized as \(\left\{{W}^{(i)}\in {\mathbb{R}}^{{M}_{i}\times {N}_{i}\times K\times K},1\le i\le L\right\}\). The matrix of connection weights within the i-th layer is represented as \({W}^{(i)}\) whereas L in denotes the total number of layers. Further, a number of input channels are represented by \({M}_{i}\) within the i-th convolutional layer, and output channels are represented as \({N}_{i}\) along with height x and \(K\times K\) denoted as width. While \({w}_{i}\times {h}_{i}\) is input feature map size and output feature map size is denoted as \({w}_{i+1}\times {h}_{i+1}\). Furthermore, the i-th layer comprises \({M}_{i},{N}_{i}\) kernel, and regarding \(k\times k\) kernel the convolutional layer computations are represented as \({M}_{i}\times {N}_{i}\times {k}^{2}\times {w}_{i+1}\times {h}_{i+1}\). As we can see in Fig. 2 which shows the step of pruning filter parameter, where its correlated feature map is also eliminated, which then shrinks \({M}_{i}\times {k}^{2}\times {w}_{i+1}\times {h}_{i+1}\) within the i-th layer. Hence, in the subsequent convolutional layer filters are also removed considering kernels have been applied in the eliminated feature maps in preceding layer. This formation saves further \({N}_{i}\times {k}^{2}\times {w}_{i+2}\times {h}_{i+2}\) processes within (i + 1)th layer, respectively. Moreover, the filters are pruned according to the importance valuations at each end of an epoch. In Fig. 2 the filters are in blue and orange horizontal bars, the importance measured through their L1 norm along with ones that contain smaller values are then chosen to be pruned.

Fig. 2
figure 2

Filter pruning process

Moreover, N indicates the images that are randomly selected which are input to the model, the greater value of N leads to more consumption of memory of the system. However, value of N is the same for each considered dataset. The minimum achieved scores of final accuracy are out of N of 16, 32, 50, and 64. Conversely, the maximum achieved scores of final accuracy are shown by N of 256 and 512 occurrences. Further, the N of 100, 128, 150, and 200 cases display the average results of the final accuracy. Consequently, the higher the value of N, the better final accuracy is achieved, also copious values can further affect the final accuracy as shown in Fig. 3. (Fig. 3 only emphasizes CIFAR-10 dataset for the VGG16 model). For instance: in Fig. 3 we can see that the loss in accuracy is higher if the N value is \({2}^{16}\) and at \({2}^{14}\) we have gained accuracy and as the value of N changes the accuracy can be gained accordingly. Thus, the Value of N highly affects the model accuracy.

Fig. 3
figure 3

Impact of N on VGG16 accuracy with CIFAR-10 dataset

3.6.2 Relevance of Weights of Filters

Fusing the lasso with L1-regularization and linear classification in Eq. 13 [47] we get:

$$V=\underset{W}{min} \sum_{\left({x}_{i},{y}_{i}\right)} l\left({x}_{i},{y}_{i},W\right)+\lambda |W{|}_{1}$$
(13)

where the square loss is denoted as \(l\left(\bullet \right)\), \(l\left({x}_{i},{y}_{i},W\right)={\left({W}^{T}{x}_{i}-{y}_{i}\right)}^{2}\). However, there is a possibility of other sort of loss function owing to feature selection. In this work, to accurately determine the significance of convolutional kernels we fused both L1-norm and capped L1-norm. The Capped L1-norm has an upper hand over L1-norm given that further penalizing cannot be utilized after a feature is extracted. Because definitively not going to provide small weights further.

$${q}_{\epsilon }\left({w}_{i}\right)=min\left(\left|{w}_{i}\right|,\epsilon \right)$$
(14)

Equation 14, the Capped L1-norm is indicated as an element-wise function, where \({w}_{i}\) is stated as weights of filters, and \(\epsilon \) stated as a constant. The more accurate approximation of the L1-norm is that it just penalizes a feature when utilized without any interference, scaling the magnitude of weights. Once \(\epsilon \) is substantially small as an instance, \(\epsilon \le \underset{i}{{\text{min}}}\left|{w}_{i}\right|,\) then we might be able to calculate an exact number of features extracted through \({q}_{\epsilon }\left(w\right)/\epsilon \). Simply put, penalizing \({q}_{\epsilon }\left(w\right)\) is a very close proxy to penalizing number of extracted features. However, the capped L1-norm is not convex which makes it quite bothersome to optimize. So, to ease up the complexity by fusing the norms such as L1-norm or L2-norm with capped L1-norm to help in controlling the trade-off among filter selection and regularization through adjusting the desired parameters, \(\mu , \lambda \ge 0,\)

$$V=\underset{W}{min} \sum_{\left({x}_{i},{y}_{i}\right)} l\left({x}_{i},{y}_{i},W\right)+\lambda |W{|}_{1}+ \mu {q}_{\epsilon }, (w)$$
(15)

Here W is training weights, \({(x}_{i}),{(y}_{i})\) are train input and output. The first section of Eq. 15 corresponds to normal training loss. The second section is a non-structured regularization which is applied to each filter \({q}_{\epsilon }\left(w\right)\) that denotes capped L1-norm that is applied to each layer. Further, in Eq. 15 there are two penalty terms namely ordinary L1- norm and capped L1-norm. The standard L1-norm drops the overfitting while second penalty which is capped L1-norm selects filters. However, the contemporary form of capped L1-norm selects features rather than selecting filters. Hence, above equation needs modification so that it directly penalizes feature extraction. To model a total number of features that are extracted through a set of filters we state a binary matrix \(F\in \left\{\mathrm{0,1}\right\}\) which has the dimension of \(\times T\), whereas an entry \({F}_{ft}=1\), assuming the filter \({h}_{t}\) utilizes feature \(f\). Now we can determine total weight assigned for filter extracted with feature \(f\):

$$W=\sum_{t=1}^{T} \left|{F}_{ft}{\beta }_{t}\right|$$
(16)

where \(\beta \) represents the sparse linear vector, by modifying \({q}_{\epsilon }\left(w\right)\) to make it penalize actual weights allocated to features. Equation 17 shows the final optimization;

$$L=\&\underset{W}{min} \sum_{\left({x}_{i},{y}_{i}\right)} l\left({x}_{i},{y}_{i},W\right)+\lambda |W{|}_{1}+\mu \sum_{f=1}^{d} {q}_{\varepsilon }\left(\sum_{t=1}^{T} \left|{F}_{ft}{\beta }_{t}\right|\right)$$
(17)

Furthermore, if \(\epsilon \) is quite small previously \(\left(\epsilon \le \underset{{\text{f}}}{min} \left|\sum_{{\text{t}}=1}^{{\text{T}}} {{\text{F}}}_{{\text{ft}}}{\beta }_{{\text{t}}}\right|\right)\), then \(\mu = 1/\epsilon \) can be set along with choosing the feature penalty relates accurately with utilized features. The capped L1-norm in Eq. 17 is for estimating the relevance of every filter. Generally, convolutional result of a filter having a smaller L1-norm value leans reasonably small compared to activation values; which results in inconsiderable numerical impact over final estimation of deep CNN-based models. We prune filters throughout several layers of CNN with the help of adjusting the pruning rate. For instance, prune 70% of the entire CNNs. However, network defined the number of filters needed to prune in every layer. This formation avoids hurdle of adjusting the pruning rate layer by layer. We used values of L1-norm and capped-L1 norm as a criterion for setting the filter choice for pruning. At this time, pruning might be able to direct towards some loss in accuracy but it occurs only when the percentage of pruning is high enough. Although, through fine-tuning this issue can be rectified. Our testing suggests that the fine-tuning procedure on pruned network might achieve better accuracy as compared to original unpruned network. The scheme is also capable of saving the training time efficiently. Algorithm 1 shows the steps based on our introduced method.

Algorithm 1
figure a

Prunning algorithm illustration

The proposed technique can perform smoothly on different architectures namely AlexNet, and VGGNet, respectively. Moreover, in this work, a Bayesian Convolutional Network learns two weights, for instance, the mean and the variance both are compared to point estimate learning one single weight. Thus, the parameters of a Bayesian Network doubled as compared to the parameters of a point estimate similar architecture. Further, we take the weights of all the layers, apply the L1 norm, and capped L1 norm over it, and for weights values with zero or below with a defined threshold are removed and the model is pruned. After that to compensate for the accuracy loss of the model due to pruning we fine-tune the model. Since Bayesian CNNs have twice the parameters we did not trim the network to half to make it similar to non-Bayesian models.

4 Experimental Framework

4.1 Experiment Settings

We evaluated our proposed framework on a few of the most extensively utilized networks and datasets. The LeNet-5 architecture utilized on MNIST, AlexNet architecture utilized on the ImageNet dataset, and VGG-16 architecture was utilized for CIFAR-10 and CIFAR-100 datasets, respectively. During training sessions, data augmentation is applied which helps in cropping images into 32 × 32 with padding of four along with a horizontal flip is performed also. Further, the mini-batch size for training is set to 100 and for the test images, the mini-batch size is set to 1000. During the process of fine-tuning and training, initial learning rate is set to 0.1 and separated through 10 at 50% and 75% of the total 150 epochs. Furthermore, while training with L1- norm and capped L1 norm, we also adjust the hyper-parameter λ. This helps to manage the trade-off between empirical loss and sparsity. This is picked through a grid search over 10−3, 10−4, and 10−5 over CIFAR-10 test set. However, we go for λ = 10−4 and λ = 10−5 for VGGNet. Lastly, all further settings are kept same as in standard training. Furthermore, The NVIDIA GTX TITAN Xp GPU is used for experiments with a Python framework known as Pytroch.

4.2 Learning the Objective Function

Bayes by Backprop is utilized for learning objective functions. This framework regularizes weights w through minimizing a compression cost which is another term for variational free energy. The intractability issue is explained in the previous section which concluded the tractable cost function as follows:

$$\mathcal{F}(\mathcal{D},\theta )\approx \sum_{i=1}^{n} {\text{log}}{q}_{\theta }\left({w}^{(i)}\mid \mathcal{D}\right)-{\text{log}}p\left({w}^{(i)}\right)-{\text{log}}p\left(\mathcal{D}\mid {w}^{(i)}\right)$$
(18)

In Eq. 18n is the number of draws. Moreover, in Eq. 18 the left side term is variational posterior \({q}_{\theta }\left({w}^{(i)}\mid \mathcal{D}\right)=\prod_{i} \mathcal{N}\left({w}_{i}\mid \mu ,{\sigma }^{2}\right)\), where variational posterior taken in a form of Gaussian distribution which is centered around mean \(\mu \) and variance \({\sigma }^{2}\), respectively. Now log and log posterior are outlined as follows:

$$ \log \left( {q_{\theta } \left( {w^{\left( i \right)} {\mid \mathcal{D}}} \right)} \right) = \mathop \sum \limits_{i} \;\log {\mathcal{N}}\left( {w_{i} {\mid }\mu ,\sigma^{2} } \right) $$
(19)

Furthermore, in Eq. 18 the other term on is prior over the weights which stated as a product of individual Gaussians and defined as follows:

$$p\left({w}^{(i)}\right)=\prod \mathcal{N}\left({w}_{i}\mid 0,{\sigma }_{p}^{2}\right)$$
(20)

whereas log and the log prior are outlined as:

$${\text{log}}\left(p\left({w}^{(i)}\right)\right)=\sum_{i} {\text{log}}\mathcal{N}\left({w}_{i}\mid 0,{\sigma }_{p}^{2}\right)$$
(21)

The last section of Eq. 18\({\text{log}}p\left(\mathcal{D}\mid {w}^{(i)}\right)\) is likelihood.

4.3 Model Pruning Results

For the compression work, we took all the weights of every layer of the network then L1 and capped L1 norm applied along with value of every weight set to zero or under a determined threshold are pruned.

As for the Bayesian-AlexNet on ImageNet dataset, \(\uplambda =5\times {10}^{-6}\) and \(\uplambda ={10}^{-4}\) and then trained the model for 100 iterations. Our proposed method pruned 65.9% of parameters along with achieving the model accuracy of 51.33% and a 39.6% reduction in FLOP. While, for LeNet-5 architecture on MNIST dataset, where 5 represents the number of layers in the network, which have an input layer containing two convolutional layers along with two fully connected (FC) layers and 431 K parameters in total. Furthermore, initial learning rate is set to 0.001–0.1 for an entire number of iterations. Our proposed method pruned 75.9% of parameters without losing much of the actuary in the process.

For VGG-16 model, we analyzed it with CIFAR-10 and CIFAR-100 datasets, respectively. Considering the VGGNet convolutional layers are diverse in robustness and information concentration [12]. As for CIFAR-10, during pruning of channels that are trained with L1-norm and capped L1-norm, a threshold of pruning is required to be calculated on filters. Further, in our method after pruning 50% of the parameters the accuracy of model gets some improvements in accuracy. The parameters pruning was able to reach 75.8% after the fourth iteration along with a 51.3% reduction in FLOP and having 0.01% enhancement in accuracy. The results are tabulated in Table 1a. Nonetheless, the process of pruning is accomplished through creating a slimmer model along with parallel weights are then copied from a model that is trained with L1-norm and capped L1-norm.

Table 1 Pruning filters for our proposed Bayesian model compared with non-Bayesian models

For the CIFAR-100 dataset on VGG-16, we used similar settings. As shown in Table 1b, the proposed method saved 35.6% of FLOPs right after the second iteration. In this formation, we choose a less pruning ratio contrary to CIFAR-10 since CIFAR-100 contains more classes and requires extra information to classify the images. Moreover, Note, that in general, we can maintain the original accuracy on VGG without sampling by simply fine-tuning with a small learning rate, as done at [50]. This will still induce (less) sparsity but unfortunately, it does not lead to good compression as the bit precision remains very high due to not appropriately increasing the marginal variances of the weights. We compared outcomes of CIFAR-10 and CIFAR-100 dataset on VGG-16 pruning filters with baseline accuracy of the Bayesian CNN model with the proposed technique Fig. 4 displays the prune ratio results.

Fig. 4
figure 4

Pruning results of VGG-16 on CIFAR-10(L) and on CIFAR-10(W) regarding pruning ratio

Furthermore, Table 1 displays the comparison between frequent networks with our proposed Bayesian CNNs network. In our proposed work, a Bayesian Convolutional Network acquires two weights, such as the mean and the variance. If we compare it to point estimate learning one single weight is there to learn. This formation makes the overall number of parameters of a Bayesian Network twice in contrast to the parameters of a non-Bayesian architecture.

For every parameter for a frequentist inference network, Bayesian CNNs have two parameters \((\mu ,\sigma )\). Instead of trimming the parameters in half to ensure the number of parameters is comparable to non-Bayesian models. Thus, having double the amount of parameters, the proposed model has achieved considerable accuracy in the process.

Thus, our proposed method has combined both L1-norm and capped L1-norm providing control over trade-offs amongst filter selection and regularization. With the help of the threshold make the weights to be zero if the value falls below the threshold and only keep the non-zero weights. Besides, we also highlighted the filter importance in every convolutional layer of the neural network along with results implying that within many layers’ maximum number of filters possesses a minor effect on architectures’ performance in general.

4.4 Uncertainty Estimation

We utilized two small datasets such as CIFAR-10 and MNIST to compare both aleatoric and epistemic uncertainties for the Bayesian LeNet-5 having variational inference. For CIFAR-10 dataset the aleatoric uncertainty is twenty times greater than the MNIST dataset. This is due to aleatoric uncertainty calculates irreducible variability besides this it relies upon predictive values. In contrast for CIFAR-10 dataset, epistemic uncertainty is approximately fifteen times greater than the MNIST dataset. As this likely happens because epistemic uncertainty most likely shrinks the proportionally to validation accuracy. Table 2 shows the comparison of both uncertainties for CIFAR-10 and MNIST datasets, respectively.

Table 2 MNIST and CIFAR-10 uncertainties comparison based on the proposed method by [50]

5 Conclusion

In this work, we suggested Bayesian CNNs with Bayes by Backprop as a variational inference approach for the CNN. We then estimated the model’s uncertainties such as aleatoric and epistemic and at the end we utilized a capped L1-norm combined with a regular L1-norm to observe the filter importance that measures the effectiveness of filter weights. We inspect uncertainties both aleatoric and epistemic and suggest that both be able to compute for the proposed work along with how epistemic uncertainties possibly compact through more training data. we applied channel pruning on Bayesian CNN which performs better and is equally decent comparable to a frequentist method.

The combined utilization of L1-norm and capped L1-norm provided control over trade-offs amongst filter selection and regularization. Besides, we also emphasized the filter importance in each convolutional layer of neural networks along with results implying that within many layers’ maximum number of filters possesses an inconsiderable impact on the architecture’ performance in general. Moreover, Considerable research illustrates the benefit of our proposed approach with a comparison of the non-Bayesian approaches. Particularly, in the VGG-16 model on CIFAR-10 datasets, our presented approach can prune 75.8% of parameters along with yield 51.3% FLOPs drop with slight accuracy improvement.

Besides, normal distribution utilization as prior for the purpose of estimating uncertainty was similarly done in [51] which showed that standard normal prior drives function posterior to simplify in unanticipated means on inputs outside of training distribution. Hence, adding noise into a normal distribution as prior is fruitful for superior uncertainty estimation. But in our experiments, we did not encounter such which can be explored in future work. Further, current work can be extended by utilizing models of Super Resolution (SR) which is the recovery of a High-Resolution (HR) image from a certain Low-Resolution (LR) image. Besides SR, another extension of the proposed work is Generative Adversarial Networks [52]. Also, a possible improvement can be made in future work to utilize trimmed versions of models such as halving the parameters in Bayesian CNNs because of two parameters instead of one similar to non-Bayesian networks to build a custom network to improve the overall accuracy.