1 Introduction

The Internet of Things (IoT), which aims to integrate the physical world by collecting and sharing information [1, 2], has been widely used in various areas, including smart city[3], smart transportation [1], smart home [4], and smart agriculture [5]. Moreover, extensive applications with IoT devices generate a large amount of data and it becomes incentive to utilize data-driven deep neural networks (DNNs) to further extract accurate information [6]. For example, a large number of biomedical data such as medical images could be smartly recognized by the convolutional neural network (CNN) to monitor human health [7]. Moreover, CNN has been widely used to process image data on IoT devices such as wireless sensor cameras [6, 8] and smart phones [7]. The authors in [3] propose several CNN-based applications in typical IoT scenarios, such as recognizing garbage images for waste management and monitoring parking spaces for smart parking lot management. Statistical evidence in [3] also shows that CNN is considered as one of the most extensively applied deep learning models for various IoT applications.

However, though it is generally believed that neural networks need to be complicated enough to represent the real-world targeted objects [9, 10], deep convolutional neural networks, usually with huge overhead and complexity in both storage and computing, are very difficult to be directly applied to resource-limited IoT devices [2]. It is essential to reduce the model size of CNN due to its large computational overhead. To address this issue, previous work primarily focuses on reducing the computational overhead and storage cost of DNNs by carefully designing the corresponding network architecture, e.g., VGG [11], GoogLeNet [12], ResNet [13], and MobileNet [14], in regarding to complex CNN for processing images. Besides, network-pruning is often adopted to compress the deep neural network itself by removing un-important inter-layer connections [15, 16], neurons or entire channels in CNN [1719]. An intuitive overview is depicted in Fig. 1.

Fig. 1
figure 1

A high level view of how pruning functions in compressing deep neural networks. Traditional pruning schemes can be divided into two classes, that is, unstructured pruning that simply removes inter-layer connections between neurons while structured pruning removes neurons as well as their connected weights, or channels with their corresponding kernels w.r.t. the convolutional neural network

Pruning at the scale of kernel in the convolutional layer, called as filter-level pruning or channel pruning, has been extensively studied and achieved exciting results with huge reduction in computation and negligible performance loss in accuracy [1720]. However, these pruning schemes generally follow the basic three-stage procedure (as shown in Fig. 2), i.e., training a redundant network from scratch, pruning it and re-training it for accuracy recovery [15], which is cumbersome and time-consuming especially for resource-limited IoT devices, leading to a huge gap between theoretical performance and practical applications. Therefore, it remains a critical concern in practical IoT scenarios to improve the traditional pruning-based DNN compression process before its efficient application.

Fig. 2
figure 2

The comparison of conventional pruning process (upper) and our proposed one (lower). We divide the original training and pruning procedure into two phases, that is, the short-term structure-learning and the long-term weight-learning

To cope with the aforementioned issue, we put forward a more concise and lightweight deep learning scheme to reveal an efficient and compact CNN structure in a more efficient manner. The typical process of our proposed scheme is depicted in Fig. 2. Specifically, the proposed strategy can be divided into two phases, i.e., structure learning and weight learning, and the latter functions in the same way as the conventional training process. During the period of structure-learning, we focus on evaluating the significance of each channel and unveiling a compact yet effective structure. To achieve the objective, we propose to evaluate the channels’ significance by Taylor criterion introduced by [17] and redistribute the remaining channels, which is stemed from weight-redistribution proposed by [21, 22]. The criterion of Taylor-expansion aims to discover those channels whose removal leads to more impact on the final loss. However, such a criterion is only calculated on the basis of a single mini-batch. In order to obtain all the channels’ significance evaluation over the entire data set, we propose a long-term assessing variable called as feature-saliency, which is computed by the moving average on each batch’s evaluation criterion. Simultaneously, considering the common finding that layers are not equally important in a deep neural network [18], we prefer to allocate more channels to sensitive layers, namely pruning less parameters in these layers. To achieve this goal, we extend the original algorithm with regard to weight-redistribution [21] to convolutional kernels and call it as channel-redistribution. Generally, inspired by the basic weights redistributing steps suggested in [21], we firstly calculate the saliency of different layers, then temporarily remove a certain proportion of channels in each layer and finally redistribute those removing channels according to the layer-wise saliency. We summarize the novel channel-redistribution algorithm in Fig. 3. Next, we have to remove the surplus channels with their corresponding kernels and train the preserved weights after obtaining the compact structure.

Fig. 3
figure 3

An overview of channel-redistribution. The basic procedure includes evaluating the layer-level saliency, temporarily removing a certain proportion of channels and redistribution. Note that a process of sparsification is conducted at the beginning of channel-redistribution

It is noteworthy that the new training model at the final stage (i.e., weight learning) is much smaller than the original one in terms of both computational cost and number of hyperparameters, implying the training process is relatively fast. In other words, the time-consuming training of a large neural network in traditional pruning methods could be avoided in our proposed scheme. The process of learning the compact structure also solves the problem of how to design an efficient DNN, namely determining the appropriate structure of neural network to be used for resource-stringent IoT devices. On the other hand, there are also some researches on solving the time-consuming training of DNNs for IoT applications by introducing the distributed architecture [2, 6, 23, 24]. A typical distributed learning process for IoT consists of training a redundant deep neural network at the cloud computing servers, and then pushing it to edge nodes. Actually, our scheme can further bridge the connection between redundant and compact neural networks at cloud and edge nodes, as depicted in Fig. 4. Considering a kind of application scenario where a large DNN model has been trained generally at a cloud server, we can retrain and prune it to gain compact networks to be more suitable for some specific IoT tasks. Compared with directly training a compact neural network from scratch, our proposed scheme transfers the knowledge of original neural networks and is able to achieve better performance, e.g., faster convergence and higher efficiency.

Fig. 4
figure 4

A high level view of how our proposed scheme functions in converting a large neural network at cloud level to a lightweight neural network in edge nodes

Our major contribution can be summarized as follows.

  • Inspired by previous work on pruning [17, 21], we propose a novel training strategy for learning compact and efficient neural networks. The proposed scheme can achieve comparatively good performance with significantly reduced model size, computational complexity, and negligible accuracy loss. Compared with the traditional pruning-based DNN compression methods, our scheme is more concise and realizes end-to-end DNN compression. Moreover, our scheme also overcomes the dilemma of designing neural network structures through adaptive structure-learning.

  • We incorporate our lightweight scheme into the common IoT applications and establish a novel paradigm for applying DNN to IoT scenarios with resource constraints yet heavy tasks. The proposed paradigm is also capable to migrate large deep neural networks to edge computing nodes through compression and re-training, which efficiently facilitates to adapt to any specific edge tasks.

  • We conduct extensive experiments on various standard benchmark datasets, including CIFAR-10 [25] and ILSVRC-12 [9], and compare with the well-recognized advanced CNN architectures, including VGG [11], ResNet [13], and MobileNet [14]. Simulation results verify the effectiveness of our scheme.

The remainder of this paper is organized as follows: Section 2 talks about some necessary backgrounds on deep neural networks and formulates the DNN-based IoT applications scenario. Section 3 gives the details of our proposed pruning scheme in terms of mathematical formulation and algorithm, while Section 4 presents the detailed experimental results. Finally, Section 5 summarizes the paper and offers future directions.

2 Background of CNN pruning

2.1 DNN-powered IoT

As mentioned before, a large amount of data produced by IoT devices promotes the application of data-driven deep nerural networks to automatically extract useful representations from raw data [2, 6]. Among many deep learning methods, CNN has been extensively used to process two-dimensional data and is further applied to IoT devices, such as smart wireless cameras [6, 8], or applications [3, 68, 26]. Typically, CNN, being composed of convolutional layers, pooling layers, and fully connected layers (as shown in Fig. 5), has a large number of parameters and huge computational overhead that limits its extensive application for resource-constrained IoT devices. Therefore, reducing the complexity of CNN has become an imperative research topic and pruning belongs to one popular means.

Fig. 5
figure 5

A typical architecture of CNN that consists of several convolutional layers, pooling layers, and fully-connected layers

2.2 Related works on pruning

Unstructured pruning Early works generally focus on pruning deep neural networks by removing redundant weights according to their magnitude [15, 16]. However, in order to obtain the significance of various weights, they have to start from training a redundant neural network in advance. In addition, the pruning weights are determined by rigidly setting a global threshold of magnitude for the whole deep neural network. Later work [27] proposes to improve the traditional pruning process by selectively learning the corresponding weights with greater impact on loss while discarding the others through cutting off their gradient flow. Moreover, both [22] and [21] propose a smoother way, namely redistributing the remaining weights, in order to obtain a proper compact structure instead of setting a threshold. They both suggest to allocate more weights to the sensitive layers, although the detailed approaches differ concretely. Our scheme aims to extend their work to another kind of redistribution at the scale of convolutional kernel, namely channel-wise redistribution for structured pruning.

Structured pruning Due to the reason that weight-pruning does not significantly reduce the amount of computation load, researchers begin to pay attention to large-scale pruning, i.e., filter-pruning or channel-pruning. Specifically, both the work [28] and [19] introduce an extra loss of “Group LASSO” to compel some kernels or the corresponding weights in batch-normalization layers [29] to zero and prune them at the end of each training. In addition, the work [30] introduces a discrimination-aware loss to keep channels that contribute to the discriminative power of neural networks. Some other methods propose to prune channels through optimizing the formulation of reconstruction error [20, 31], reducing the similarity between features [32, 33], and directly evaluating channels’ significance [17, 18]. Our algorithm is based on the evaluation of channel saliency as well. Furthermore, some recent pruning methods introduce advanced machine-learning-based approaches, such as meta-learning [34] and generative-adversarial-learning [35], which also achieve remarkable results.

Pruning with new paradigms Nearly none of the aforementioned methods deviates from the three basic steps of pruning, that is, training an over-parameterized neural network, pruning, and fine-tuning it. Based on the argument in [36] that a compact DNN model trained from scratch can reach competitive performance compared with its redundant counterpart, the traditional pruning strategy may be too time-consuming and outdated, thereby not suitable for the cloud-to-edge distributed computing architecture for IoT applications. Recent work like [37] introduces a novel pruning strategy that temporally removes unimportant kernels but keeps them updated in the phase of training, namely soft pruning. Moreover, the paper [38] proposes to prune the model from scratch on the basis of random initialization. This model in [38] to find a compact structure by introducing group LASSO loss to the batch-normalization layers as same as network slimming [19]. However, our scheme differs in that we are inspired by the works in [17] and [21] and design a completely different structure-learning algorithm through evaluating channels’ importance and channel redistribution accordingly. In addition, all the parameters of the neural network are also updated simultaneously during the structure-learning process, in contrast to [38], since some prior weights in the training of large neural networks are still effective for the training of the compact counterparts, which is better than random initialization. In addition, some other pruning methods [39] propose to learn an efficient structure by automatic search that functions in a similar way to Network Architecture Search (NAS) [40]. In fact, many NAS schemes [4143] aim to find a proper structure with excellent performance on exact datasets. However, NAS-based schemes require much more computing resources and data to search for connections between neurons or convolutional channels from scratch, while pruning-oriented schemes, based on exiting models, aim to reduce the complexity by searching over a smaller space with less resource overhead, and therefore are more suitable for IoT terminal deployment.

2.3 Potential applications in IoT scenarios

In order to reduce both training and inference cost of DNN, previous works take into account the cloud-and-edge computing architecture for data-heavy IoT applications and propose a distributed computing paradigm [2, 6, 23, 24]. As illustrated in Fig. 4, one may regard our proposed paradigm as a supplement to the original architecture, in which we improve the conventional process of copying the parameters from the cloud to the edge by introducing an efficient re-training scheme with structure-learning and weight-optimization, thereby making the model adaptable to any personalized IoT applications as well as reducing the redundant parameters and computational overhead.

3 Methods

3.1 Notations

Beforehand, we formally give some symbol notations used throughout the paper. Suppose we have a deep neural network with L convolutional layers, \(\mathbf {w}^{l}_{k}\) and \(\mathbf {z}^{l}_{k}\) are used to represent the convolutional kernel and the individual output channel of the l-th convolutional layer, respectively. The subscript k∈[1,⋯,Cl] represents the channel index, where Cl indicates the total number of output channels in the corresponding layer. We further use Hl and Wl to indicate the height and width of channels in the l-th layer, respectively. Pruning the k-th channel in layer l signifies removing the corresponding kernel \(\mathbf {w}^{l}_{k}\). Moreover, we define fl to represent long-term evaluation of channels, i.e., feature saliency, \(\mathbf {f}^{l}\in \mathbb {R}^{C^{l}}\). Overall, we summarize all notations in Table 1.

Table 1 Notations and their definitions

At the beginning of training, each layer retains the same proportion of channels, which is controlled by the pruning rate p. These preserved channels will be adaptively redistributed at the end of each training epoch. Succinctly, the preserved channels are called as activated channels. We further define [al]i to represent the number of activated channels in the l-th layer where the subscript i refers to the iterative epoch of training. The initialized values of [al]i are

$$ [a^{l}]_{0} = pC^{l}~~~\forall~l, 1\le l\le L $$
(1)

3.2 Criterion of channel significance

In order to evaluate the channels’ saliency, we adopt a Taylor-expansion [17] based criterion. Considering a mini-batch B={X={x1,x2,...,xm},Y={y1,y2,...,ym}}, the final loss on the batch B can be defined as J(B,W) where W represents the network parameters. Suppose a kernel \(\mathbf {w}^{l}_{k}\) with respect to its activation \(\mathbf {z}^{l}_{k}\) is removed, the corresponding impact on the cost function J can be expressed as

$$ \left|\Delta J(\mathbf{z}^{l}_{k})\right| = \left|J(B,\mathbf{z}^{l}_{k}) - J(B, \mathbf{z}^{l}_{k}\to 0)\right| $$
(2)

We use the Taylor series to expand the cost function at point \(\mathbf {z}^{l}_{k}=0\)

$$ J(B, \mathbf{z}^{l}_{k}\to 0) = J(B, \mathbf{z}^{l}_{k}) - \frac{\partial J}{\partial \mathbf{z}^{l}_{k}}\mathbf{z}^{l}_{k} + o\left(\left(\mathbf{z}^{l}_{k}\right)^{2}\right) $$
(3)

Ignoring the higher-order remainder and substituting (3) to (2), we have

$$ \begin{aligned} \Theta_{k}^{l} \triangleq \left|\Delta J(\mathbf{z}^{l}_{k})\right| &= \left|J(B,\mathbf{z}^{l}_{k}) - J(B, \mathbf{z}^{l}_{k}) + \frac{\partial J}{\partial \mathbf{z}^{l}_{k}}\mathbf{z}^{l}_{k}\right| \\ &= \left|\frac{\partial J}{\partial \mathbf{z}^{l}_{k}}\mathbf{z}^{l}_{k}\right| \end{aligned} $$
(4)

The criterion can be regarded as a measure of the significance of feature maps for a single-entry mini-batch. For a channel with multi-variate output, the item \(\Theta ^{l}_{k}\) can be rewritten as

$$ \Theta_{k}^{l} = \left|\frac{1}{M}\sum_{m=1}^{M}\frac{\partial J}{\partial z^{l}_{k, m}} z^{l}_{k, m}\right| $$
(5)

where M is the total number of channel’s entries. The computation of item \(\Theta _{k}^{l}\) requires the activation and the gradient, which can be calculated from the forward and backward propagation, respectively. Furthermore, we impose an extra re-scaling method with max-normalization, that is

$$ \hat{\Theta}_{k}^{l} = \frac{\Theta_{k}^{l}}{\max\limits_{j}\left\{\Theta_{j}^{l}\right\}} $$
(6)

Such normalization process is essential since we need to ensure that these evaluation values of each layer are at the same scale. Its function is similar to batch-normalization [29], which ensures that the statistics of layer-wise evaluation values are under the same distribution. Equation (6) indicates that the maximum criterion values regarding different layers are all normalized to 1, resulting in comparable scale of feature saliency fl, which is defined as a long-time estimating variable for individual channels

$$ \left[f^{l}_{k}\right]_{\text{new}}= \epsilon\left[f^{l}_{k}\right]_{\text{old}} + \hat{\Theta}_{k}^{l},f^{l}_{k}\in\mathbf{f}^{l},1\le k\le C^{l} $$
(7)

where the hyper-parameter ε is a smoothing factor set to 0.98 for all experiments in this paper. The feature-saliency helps determine which channels are retained when the structure is fixed. Note that the values of fl update with each mini-batch in a training epoch, we omit the iterative epoch index i for simplicity of representation.

3.3 Channel redistribution

The proposed channel redistribution process occurs at the end of each training epoch, which is indicated by the subscript i. Note that the aforementioned feature saliency evaluation is based on a single channel, it is necessary to calculate the significance of each layer which has several channels in order to obtain an efficient structure. Suppose [ξl]i indicates the corresponding layer’s significance for the iterative epoch i

$$ [\xi^{l}]_{i} = \frac{\sum_{k} \hat{f}^{l}_{k}}{[a^{l}]_{i-1}},~~\hat{f}^{l}_{k}\in\hat{\mathbf{f}}^{l} $$
(8)

where \(\hat {\mathbf {f}}^{l}\in \mathbf {f}^{l}\) is its subset which contains several large values of feature saliency of the corresponding layer and the total number of elements in \(\hat {\mathbf {f}}^{l}\) is [al]i−1. Next we need to normalize all [ξl]i so that the sum of these values is 1.

$$ [\hat{\xi}^{l}]_{i} = \frac{[\xi^{l}]_{i}}{\sum_{j}[\xi^{j}]_{i}} $$
(9)

Looking again at the channel redistribution process shown in Fig. 3, after obtaining each layer’s significance evaluation, the following step is to temporarily remove a fixed proportion of channels to release some reallocating space, followed by redistributing channels according to the calculated values about layers’ significance, that is, updating the number of activated channels [al]i−1 in each layer. Given that the updated value may exceed the maximum number of channels in the corresponding layer, the value of [al]i is limited to Cl which is the total number of channels in the original structure as the following formula

$$ [a^{l}]_{i} = \min\left\{(1-s)[a^{l}]_{i-1} + [\hat{\xi}^{l}]_{i}\left(s\sum_{l=1}^{L} [a^{l}]_{i-1}\right), C^{l}\right\} $$
(10)

where the sparsity s is a hyper-parameter which is predefined to indicate how many channels are reallocated each time. Different from the original work [21] that adjusts the value of s throughout training, the sparsity s is fixed to 0.5 in our experiments. Moreover, we allocate the extra channels evenly among the other layers if necessary. All the relevant details are shown in Algorithm 1.

Furthermore, after uncovering a suitable compact structure, we need to remove those insignificant channels with their convolutional kernels and train the remaining weights to obtain the representative capability, as depicted in Fig. 2. In the period of pruning, the remaining channels are determined according to their feature saliency fl as well as the number of the activated channels al in the corresponding layer. Overall, we summarize the total process of our pruning scheme in Algorithm 2. For the reason that channel-pruning is simply applied to convolutional layers, we have omitted the general batch-normalization [29] layers, activation layers, pooling layers, and fully connected layers for simplicity.

3.4 Discussion

As most of the heavy computation is concentrated on convolutional layers, we only need to pay attention to computational overhead or saving in these layers. Suppose the output channel size of the l-th layer is Hl×Wl and the final number of activated channels is Al, accordingly (ClAl) kernels in the corresponding layers will be removed. Therefore, the dimension of remaining channels in the lth layer is Hl×Wl×Al and the computation in terms of FLOPs (floating-point operations) in such layer decreases from K2×Cl−1×Hl×Wl×Cl to K2×Al−1×Hl×Wl×Al, where the label K indicates the kernel size. Compared to the raw FLOPs with respect to individual layers, a reduced ratio of \(\left (1-\frac {A^{l-1}A^{l}}{C^{l-1}C^{l}}\right)\) is obtained, leading to large decrease in the computational cost of CNN.

4 Results and discussion

4.1 Experimental setting

We evaluate our scheme on various representative benchmark datasets, including CIFAR-10 [25] and ILSVRC-12 [9], and compare with the advanced DNN architectures, including VGG [12], ResNet [13], and MobileNet [14]. CIFAR-10 contains 50,000 training images and 10,000 testing images, which are categorized into 10 classes. We follow the common data augmentation suggested by [13] with shifting and mirroring. Both architectures are trained from scratch using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.1. The learning rate is decayed by 10 times in every one third of the total number of iterations. The weight decay and momentum is 10−4 and 0.9, respectively. ILSVRC-12 contains 1.3 million training images and 50,000 validating images without test set. While evaluating on ILSVRC-12, we also follow the training settings and the strategy of data augmentation suggested by [13] and adopt Pytorch [44] which is the fundamental framework of our experiments. Note that those advanced CNN architectures including VGG and MobileNet are designed for large dataset like ImageNet, re-training, and pruning them to match small dataset like CIFAR-10 could be viewed as a suitable verification platform of distributed training process in both cloud and edge nodes for IoT applications.

In order to verify the effectiveness of our proposed scheme, we formally compare our scheme’s performance with that of various state-of-the-art pruning approaches, including PFEC [18], NS [19], CP [31], ThiNet [20], SFP [37], CFP [32], DCP [30], FPGM [45], COP [33], GAL [35], PFS [38], and ASS [39]. Moreover, we present the performance of our scheme in terms of both theoretical acceleration and practical acceleration with respect to various pruning rates to show the robustness and efficiency of our scheme. Overall, our proposed method has achieved comparable and satisfactory results even with more concise pruning program, which could be effectively incorporated into the common distributed training paradigm for anticipated IoT applications.

4.2 Experiments on CIFAR-10

Pruning VGG. Though VGG is not designed for small data set like CIFAR-10, previous work have studied its performance at extremely high pruning rates. We firstly train an original non-pruned 16-layer VGG as baseline (no pruning) and then run several experiments with different pruning rates from scratch. We compare the testing accuracy with that of the previous state-of-the-art approaches and summarize the corresponding results in Table 2.

Table 2 Results of pruning VGG on CIFAR-10

As shown in Table 2, our proposed scheme can achieve comparable results with the aforementioned state-of-the-art methods with different reduced FLOPs and parameters. For instance, a compact model with 49.3% in FLOPs drop achieves superior accuracy compared to the baseline performance. In the case where the FLOPs and number of parameters are reduced by 72.6% and 94.1%, respectively, the pruned VGG based on our scheme can still maintain an applicable accuracy of 93.27% for such dataset.

Pruning ResNet. Note that compact ResNet architectures with less channels in each layer are built up in [13] for recognizing images from CIFAR-10, we adopt the recommended 32-layer and 56-layer ResNet as baselines (no pruning). Specifically, for the reason that the input/output number of channels within a residual block must be consistent to ensure the short-cut connection, we only prune the first layer’s output channels per block.

It can be observed from Table 3 that our proposed strategy can achieve competitive results. For example, the compact ResNet-32 with 49.0% reduction in FLOPs and 60.1% reduction in parameters still retains an accuracy of 92.50% (i.e., 93.20–0.70% = 92.50%). In addition, more experiments on pruning ResNet-56 further verify the effectiveness of our algorithm. For example, in the case where the FLOPs reduction and the parameters reduction are 49.6% and 58.0%, respectively, the performance of the compact model established by our scheme only decreases by 0.17% in accuracy.

Table 3 Results of pruning ResNet on CIFAR-10

Pruning MobileNet. We design a MobileNet-like neural network with less layers for simplicity. Its primeval structure contains ten blocks with each block including a depth-wise convolutional layer and a point-wise convolutional layer [14]. Since the output channels of depth-wise convolutional layer change as soon as the channel number of its previous point-wise layer changes, we only need to focus on pruning channels in the point-wise convolutional layers. The pruning results are shown in Table 4. Overall, our algorithm can still achieve good performance even for such computationally efficient architecture. For example, when FLOPs and parameters compression ratio increases to 61.3% and 92.9%, respectively, the accuracy loss is only 0.27%.

Table 4 Results of pruning MobileNet on CIFAR-10

4.3 Experiments on ImageNet

We adopt a widely studied architecture ResNet-50 as in the previous pruning approaches. Different from general ResNet architecture, ResNet-50 contains a special structure called “bottleneck” [13], which includes three convolutional layers with only the middle layer being expressive in each residual block. Similar to pruning ResNet on CIFAR-10, we focus on pruning the channels of the first two layers in a bottleneck, so that we do not need to worry about the identity mapping when copying the parameters to a compact model. We summarize the experimental results on ILSVRC-12 in Table 5 where we report the performance of both the advanced approaches and ours. It can be observed that the pruned model based on our scheme can reach a comparable accuracy along with significant reduction in both FLOPs and parameters.

Table 5 Results of pruning ResNet-50 on ILSVRC-12

Note-worthily, our method is indeed not as effective as some start-of-the-art algorithms. However, these advanced algorithms have added additional training strategies or enlarged the training time, but our algorithm is very efficient and simple, thus being deployed in a wide range of IoT scenarios.

4.4 Trade-off between performance and compression rate

In practical IoT scenarios, it is necessary to balance the performance and compression rates according to different computing requirements and energy consumption restrictions. On the other hand, showing the performance with various compression rates can also illustrate the robustness and efficiency of an pruning algorithm. Thus in this section, we explore the performance of our scheme upon different pruning rates. For all experiments with different network architectures, we use the same hyper-parameter settings. We summarize the results in Figs. 6, 7, and 8, which corresponds to ResNet, VGG, and MobileNet, respectively.

Fig. 6
figure 6

The pruned results with respect to various pruning rates, which are obtained by pruning ResNet-32 on CIFAR-10

Fig. 7
figure 7

The results are based on pruning VGG on CIFAR-10 with various pruned proportions in both FLOPs and parameters

Fig. 8
figure 8

Pruning MobileNet on CIFAR-10 with the compression-rate varying from 0.0 to 0.9

As can be observed in Fig. 6, ResNet architecture is sensitive to pruning. When the FLOP reduction proportion increases to 0.6, the performance in accuracy drops by nearly 1.0%. In addition, when pruning VGG and MobileNet, our proposed scheme is more robust in terms of various reduced FLOPs as well as pruned parameters. As depicted in Figs. 7 and 8 with regard to pruning VGG and MobileNet, respectively, our proposed strategy can achieve efficient neural network structures with even higher testing accuracies compared to their baselines at the low level of compression rate for both VGG and MobileNet. Such interesting results also indicate that the performance of compact models may outperform that of redundant models to some extent, which implies that the premise of efficient training is to unveil superior neural network with a suitable structure.

4.5 The uncovered compact structures

In this section, we take advantage of the sub-neural-network architectures revealed by our proposed method. Note that a practical problem of deploying DNNs is how to design appropriate lightweight structures to adapt to resource-limited IoT computing tasks, so learning the compact structures can help us design efficient neural networks beyond the state-of-the-art architectures. As seen from Fig. 9, compared with the original deep neural networks with no pruning, our scheme keeps more channels in the middle layers of the designed neural networks while effectively pruning more channels in the last layers and the first layer in the case of pruning VGG on CIFAR-10. The discovered structure suggests that the middle layers are more sensitive whereas the first layer and the last layers are easier to be pruned, which is consistent with the previous findings in [18, 19], indicating the effectiveness of our proposed method.

Fig. 9
figure 9

Channel distribution of the pruned VGG on CIFAR-10. The abscissa indicates the indices of layers and the ordinate indicates the number of reserved channels accordingly

It can be observed from Fig. 10 that when pruning ResNet on CIFAR-10, the compact model tends to maintain more channels in layers where the number of channels doubles, suggesting those layers are more salient. Similar interesting phenomenon is found when pruning ResNet on ImageNet as well. As depicted in Fig. 11, although the distribution of the pruned channels appears to be disordered to some extent, more channels are still retained in the “turning-point” layers where the number of channels in the original neural network jumps abruptly. Our proposed compact ResNet structure is consistent with the conclusion of sensitivity analysis in [18].

Fig. 10
figure 10

Channel distribution of the pruned model for ResNet on CIFAR-10. We only present the pruned layers, namely the first layer within each residual block

Fig. 11
figure 11

The architecture of pruned ResNet-50 on ImageNet

4.6 Acceleration in practice

In this section, we show the running-time acceleration performance of the designed compressive neural networks in practice. We test all compact CNNs on several Intel E5 CPUs with the software platform of Pytorch deep learning framework in the operating system of Ubuntu 16.04. Due to the reason that the running time on GPUs is too short to manifest the differences among different methods as well as running on GPUs is not suitable for practical IoT devices, we have not shown the actual acceleration performance on such devices. For each compact neural network, we measure the time of forward propagation for 100 rounds and average them. The overall experimental results are organized in Table 6 where we present both theoretical amount of computation in FLOPs and practical acceleration results.

Table 6 Theoretical amount of computation in FLOPs and the corresponding acceleration in practice

As shown in Table 6, the test results of each row are obtained by reducing the FLOPs of the corresponding neural network model by 50%, and the practical acceleration performance is consistently effective and impressive for all representative CNN architectures. In addition, the actual acceleration performance of MobileNet is significantly higher than that of both ResNet-50 and ResNet-56, indicating its potential suitability for resource-stringent IoT devices.

4.7 Training time measurement

In fact, one important issue hindering the application of DNN is its complexity in training time. However, our scheme is more efficient as both structure and weight learning are relatively faster in terms of the common training time, especially in the case where initial weights are transferred from post-training models (e.g., inheriting the network parameters from cloud). To be specific, we experiment on one Nvidia RTX-2080 GPU, with the software platform of Pytorch and the dataset of CIFAR-10. Figure 12 provides the performance comparison in terms of the normalized training time of all neural networks. It can be observed from Fig. 12 that the time cost of structure learning is much shorter than that of parameter optimization, which indicates that our scheme is very efficient in finding the compact structures. In addition, the total training time decreases as the pruning rate increases in all experiments, implying our proposed scheme’s efficiency as well.

Fig. 12
figure 12

The normalized training time w.r.t. different pruning rates and CNN architectures. The training time refers to the total time spent in training the neural networks to achieve the highest testing accuracy

5 Conclusions

In this paper, we proposed a novel pruning-based paradigm that aims to apply DNN, especially CNN, to resource-limited IoT scenarios. Our proposed scheme has the capability to train and compress deep neural networks simultaneously. Specifically, we introduce a heuristic algorithm to learn both the architecture and weights of the targeted neural network. Once a compression rate is given, our scheme can train a redundant and randomly initialized neural network into a compact, representative one. A large number of experiments have illustrated the effectiveness of our scheme, which can reduce the complexity of the redundant CNN while maintaining its performance, for example, a satisfying accuracy of 93.27% of the pruned VGG with dramatic reduction in FLOPs and the number of the involved parameters (i.e., 72.6% and 94.1%, respecitvely). In addition, extensive experiments also verify the performance of our scheme regarding various pruning rates in terms of both theoretical acceleration and practical running time reduction.

As mentioned before, our proposed strategy can realize efficient end-to-end training and compression of CNN and is able to be incorporated into the conventional distributed computing paradigm to apply deep learning to resource-limited IoT applications. Moreover, our scheme is lightweight and can be easily extended to other types of DNNs. For future work, we will apply the proposed pruning scheme to actual IoT scenarios to further testify its effectiveness.