1 Introduction

The Artificial Intelligence of Things (AIoT), a promising integrated technology that combines artificial intelligence and Intelligence of Things, is drawing significant interest [1]. However, feedback from AIoT systems usually has unacceptable latency due to the limited bandwidth of the network and instability of communication [2, 3]. The current trend is to implement deep learning algorithms on edge devices that process the raw data close to the data source [4].

Fig. 1
figure 1

Diagram of the convolution operation. The dotted parts indicate data processing operations

Convolutional neural networks (CNNs) are highly regarded among deep learning technologies due to their impressive performance in a variety of applications such as object recognition [5,6,7], healthcare [8, 9], image generation [10] and anomaly detection [11, 12]. The development of CNNs nowadays is accompanied by increasing memory usage and computational complexity. Whereas edge devices are heavily constrained in computation power, memory bandwidth, and power consumption [13]. As a result, the latency of the full CNN algorithm on an edge device is normally unacceptable.

Compression and quantization are commonly required before deploying CNN-based applications onto edge devices [14]. The CNN compression methods include: channel pruning [15], knowledge distillation [16, 17], matrix decomposition and so on. The channel pruning methods aim to identify less important channels (i.e., filters) and remove them.

Despite the channel pruning working well on reducing Floating Point Operations (FLOPs), such method has limited effect on latency optimization. We divide the low-level operations of the convolutional layer into matrix–vector multiplication (MVM) operations and data processing operations, as shown in Fig. 1. The MVM operations are mainly from the convolution between the filters and the inputs. They take up most of the FLOPs [18], therefore are the main compression target of channel pruning. Data processing operations are performed before and after the MVM operations, including padding of feature maps, rearrangement of the input feature map (Im2col), re-quantization of the output, and storage of results [19]. The latency due to data processing operations is defined as start-up latency.

Fig. 2
figure 2

Variation in latency and FLOPs when pruning the convolutional layer with 8 output channels

Figure 2 demonstrates the variation in latency and FLOPs when pruning the output channel of convolutional layer. It could be seen that reduction in latency is not as significant as FLOPs. Even when there is only one output channel left, 80% latency remains, which is the limit of channel pruning. Pruning the output channels effectively reduces MVM operations, but does not optimize data processing operations at the inputs. The limitation of pruning results in significant start-up latency remaining.

Fig. 3
figure 3

Diagram of the CNN architecture optimization and reconstruction method. The top of the figure represents the GC stage, which constrains the main path of the residual block via the adjunct layers. The bottom shows the SLR stage, where the main path of redundant residual blocks is removed

To effectively optimize start-up latency, the data processing operations on both the input and output sides should be reduced. However, mainstream pruning strategies have difficulty in achieving this goal. In recent years, a significant number of networks have adopted the design of residual blocks [20]. Residual block consists of two parts: the main path and the residual connection. The main path is composed of multiple weight layers, including convolution, batch normalization (BN), and activation layers. The residual connection adds the input directly to the output of the main path, which requires that the input and output tensor shapes are the same. Consequently, the pruning strategy for residual blocks adopted by most studies [21, 22] is to keep the entire input channels of the first layer and the output channels of the last layer. To further the latency reduction, it is worthwhile to improve this pruning strategy to achieve a further reduction in latency.

This fact encourages us to propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The CNNs are optimized in two stages: Global Constraint (GC) and Start-up Latency Reduction (SLR). The GC stage aims to achieve lossless channel pruning. The main paths are constrained by the adjunct layers, and the expression of redundant channels is blocked. Then, the adjunct layers are equivalently converted into the BN layers to achieve channel pruning. Next, the SLR stage aims to optimize start-up latency. Residual blocks that do not function efficiently due to constraints are identified and pruned in SLR stage. Finally, the optimized network is implemented on multiple platforms, and the reduction in latency and FLOPs is evaluated.

The main contributions of the paper are as follows:

  • We improve the mainstream pruning strategy to further reduce latency. Experimental results show that this approach optimizes more latency than channel pruning alone.

  • We propose a general CNN application acceleration approach. The optimized network deployed on desktop CPU [23], GAP8 [24], FPGA ZCU102 [25], and Raspberry Pi 4 Model B [26] platforms through official deployment flow. Significant latency reductions are achieved without application-specific optimizations.

The rest of the paper is organized as follows: Sect. 2 introduces related works. Section 3 details the methodology. Section 4 shows the experimental results. Section 5 concludes this paper.

2 Related work

2.1 Channel pruning

Channel pruning is a common method for compressing CNN algorithms. Since the performance of CNNs depends on a huge number of parameters, the challenge of channel pruning is the trade-off between accuracy and compression effect.

To minimize performance loss, removing the low-importance channel [27, 28] or feature map [29,30,31] is a practical solution. Kuang et al. [21] believe it is a better choice to directly measure the effect of filters on task-related loss function than based on the magnitude of weights. In detail, the network is randomly pruned several times, and the actual effect of the corresponding filter on the network is calculated by the proposed task-related loss function after each channel is removed. Finally, the channels with low effect obtained from the measurements are discarded until the FLOPs reach the target, then the pruned network is fine-tuned.

Some methods [32,33,34,35] introduce the regularization term in the optimization objective to create sparsity in parameters. Chen et al. [36] propose a collaborative channel pruning method. They found that many previous studies evaluated channel importance only based on a single structure, which may lead to the mistakenly removing of important channels. Therefore, L1 regularization is introduced to the convolutional layer weights and BN layer weights, respectively, to enhance sparsity. Next, the effect of the channel is evaluated according to both the convolutional layer and the corresponding BN layer, and channels with low effects are removed. Atashgahi et al. [37] inspired by biological brain evolution and Hebbian learning theory, propose a sparse training method based on the behavior of neurons. Specifically, at each epoch, the weights with the smallest magnitude are dropped, and then the most important connections to be added are obtained based on cosine similarity of each pair of neurons in two consecutive layers.

In some methods channel pruning is converted to other optimization problems. Ding et al. [38] consider that pruning methods degrade the performance of the network since they introduce desired structural sparsity to the network, which may change the objective of the optimization, resulting in parameters far from optimal. Thus, the compactor, \(1 \times 1\) convolutional layer, is inserted following convolutional layer. Channel pruning is converted into the compactor sparse problem. The penalty is applied to compressors to create sparsity. Then, channel pruning is achieved by parameter merging. Guo et al. [39] propose Differentiable Markov Channel Pruning (DMCP). In DMCP, channel pruning is modeled as the Markov process, where the retention of each channel is considered as a state, and pruning represents the transition between states. The probability of retaining the \((k + 1)^{{{\text{th}}}}\) channel when the \(k^{{{\text{th}}}}\) channel is retained is expressed as a learnable parameter. With this approach, the optimal architecture is explored during training.

However, the input and output channels of the residual block are not modified in these proposals, which means that latency could be further optimized.

2.2 Deployment

We summarize the common deployment flow as the following steps: (1) network compression, including channel pruning and knowledge distillation. (2) Quantization. Generally, quantize network parameters to INT8 to reduce memory access latency [40]. (3) Computation node fusion. The network is transformed into a computation graph, each operation of the original network corresponds to one computation node. Normally, less compute-intensive nodes, such as the BN and activation layers, are fused into convolutional layer nodes to reduce start-up latency. Since the complexity of convolutional layers, the fusion of the two convolutional layers into one simpler node is not available in existing deployment methods. (4) Compiling. This step includes hardware-level optimizations such as instruction scheduling, memory reuse, etc. Where steps (3) and (4) are highly dependent on the CNNs acceleration libraries provided by the developer [41].

Each framework and device has official deployment tools provided by the developer, here introduce deployment tools of two edge devices are introduced. GAP8 is an IoT application processor based on RISC-V and PULP platform [42], developed by GreenWaves Technologies, which is featured by low power consumption and parallel processing. The developer also provides a deployment toolset GAP flow [43]. Developers can quantify and deploy CNNs using deployment tools NNTOOL [44] and AutoTiler [45] and simulate them on GVSoC [46]. Development environment Vitis-AI [25] accelerates AI inference on AMD hardware platforms, including FPGA Zynq UltraScale+ MPSoC ZCU102 [47]. It includes optimized IP cores, AI Quantizer for quantifying CNNs, and an AI Compiler for optimizing and compiling computational graph of CNNs.

3 Methods

Removing residual blocks from the network with low accuracy loss is a challenge. The output of each residual block is obtained from main path and skip connect, the loss of features from each path would cause unacceptable harm. Therefore, there are two stages of the optimization, GC and SLR. First, the GC stage aims to achieve channel pruning, where constraints are added to the main paths, weakening the effect of the redundant main paths. Next, the pruned network is further optimized by the SLR stage to reduce the start-up latency. SLR stage aims to prune redundant residual blocks. Benefiting from the constraints of GC stage, the damage due to pruning is minimized. The optimization process is illustrated in Fig. 3.

Fig. 4
figure 4

Diagram of the Conv-BN-ad structure

3.1 Global constraint

3.1.1 Definition of adjunct layer

To convert the optimization of the model structure into an optimizable problem, adjunct layers are introduced. The convolutional layer and the following BN layer are defined as Conv-BN. Adjunct layer is inserted after the Conv-BN and form the Conv-BN-ad structure, as shown in the Fig. 4. The insertion location is described in detail in Sect. 4.1. The output channels of adjunct layer are divided into groups, and output of each group is managed by a learnable parameter called expression parameter, which is limited to [0, 1]. With the introduction of the expression parameter, the pruning of channel could be considered as a continuous variation.

Next, the output of Conv-BN-ad is explained. Let \(\gamma _i\) and \(\beta _i\) denote the weight and bias of the \(i^{{{\text{th}}}}\) channel of BN layer, \(\mu _{B}\) and \(\sigma _{B}\) are the batch mean and variance, \(\epsilon\) is an arbitrarily small constant, \(\mathbb {R}_k\) is the \(k^{{{\text{th}}}}\) group expression parameter corresponding to the \(i^{{{\text{th}}}}\) channel. Then the \(i^{{{\text{th}}}}\) output of the adjunct layer is represented as:

$${\text{Output}}_{{\text{i}}} = \mathbb{R}_{k} \cdot \left( {\gamma _{i} \frac{{O_{i} - \mu _{B} }}{{\sqrt {\sigma _{B}^{2} + \varepsilon } }} + \beta _{i} } \right).$$
(1)

To avoid extra cost to the edge device, additional structures should be removed before deployment. The adjunct layers are removed by two steps: parameter merging and channel pruning. First, according to Eq. 1, it is possible to merge \(\mathbb {R}_k\) into the weight and bias of the BN layer. Merged weight \(\hat{\gamma _i}\) and bias \(\hat{\beta _i}\) are obtained as:

$$\begin{aligned} \begin{array}{lll} \hat{\gamma _i} = \mathbb {R}_k \cdot \gamma _i \\ \hat{\beta _i} = \mathbb {R}_k \cdot \beta _i \\ \end{array} \end{aligned}$$
(2)

Second, channels are pruned according to \(\hat{\gamma _i}\). If the absolute of \(\hat{\gamma _i}\) is close to zero, then the output of \(i^{{{\text{th}}}}\) channel could be pruned harmlessly. In the experiment, the channels with \(\hat{\gamma _i}\) less than the \({\text{threshold}}_{{{\text{BN}}}}\) are pruned. The \({\text{threshold}}_{{{\text{BN}}}}\) is set to \(1 \times 10 ^{-3}\). According to our measurements, pruning based on this threshold causes minor damage to accuracy. Finally, the adjunct layers are removed.

3.1.2 Loss function

The GC stage is intended to add constraints to the expression parameters and achieve lossless pruning. To this end, the loss function \({\text{loss}}_{{{\text{GC}}}}\) is proposed. It contains three parts, the FLOPs constraint term \({\text{losss}}_{{{\text{FLOPs}}}}\), the sparse term \({\text{loss}}_{{{\text{sp}}}}\) and the BN constraint term \(\Vert \gamma _{ad}\Vert _1\).

First, the FLOPs constraint term \({\text{losss}}_{{{\text{FLOPs}}}}\) is explained. In the channel pruning, the change of FLOPs is discontinuous, which makes it difficult to optimize by gradient descent. So channel pruning is converted into a continuous process by the assistance of the expression parameters.

In detail, FLOPs are calculated based on the expected number of out channels \({\mathbf{E}}^{n} [{\text{Out}}]\). The \({\mathbf{E}}^{n} [{\text{Out}}]\) for the \(n^{{{\text{th}}}}\) Conv-BN-ad could be computed as:

$${\mathbf{E}}^{n} [{\text{Out}}] = \sum\limits_{{k = 1}}^{G} {\mathbb{R}_{k} } \cdot {\text{Channel}}_{{\text{k}}}$$
(3)

where \({\text{Channel}}_{{\text{k}}}\) is the number of channels in \(k^{{{\text{th}}}}\) group, and G is the number of output channel groups.

And for the Conv-BN structure that does not contain an adjunct layer, \({\mathbf{E}}^{n} [{\text{Out}}]\) is its actual number of channels.

After determining the \({\mathbf{E}}^{n} [{\text{Out}}]\), the FLOPs of the \(n^{{{\text{th}}}}\) convolutional layer are calculated as:

$${\mathbf{E}}^{n} [F] = \frac{{{\mathbf{E}}^{{n - 1}} [{\text{Out}}]}}{{{\text{groups}}}} \times {\mathbf{E}}^{n} [{\text{Out}}] \times W_{{{\text{Out}}}}^{n} \times H_{{{\text{Out}}}}^{n} \times K^{n} \times K^{n}$$
(4)

in which \(W_{{{\text{Out}}}}^{{\text{n}}}\) and \(H_{{{\text{Out}}}}^{{\text{n}}}\) are the width and height of the output feature map and \(K^{n}\) is the size of the convolutional kernel. groups denotes the number of groups for grouped convolution.

After clarifying the expected FLOPs of each convolutional layer, for the set Q of all convolutional layers in the residual block, the total expected FLOPs \(\textbf{E}[F]\) is:

$$\begin{aligned} \textbf{E}[F] = \sum _{n=1}^{Q} \textbf{E}^{n}[F] \end{aligned}$$
(5)

Then the \({\text{loss}}_{{{\text{FLOPs}}}}\) is introduced as Eq. 6. Here, the value of \(\textbf{E}[F]\) is commonly excessive compared to other loss. Thus, the L2 norm of the original FLOPs \(\textbf{Ori}[F]\) of the network is introduced to normalize \(\textbf{E}[F]\).

$${\text{loss}}_{{{\text{FLOPs}}}} = \frac{{{\mathbf{E}}[F]}}{{\left\| {{\mathbf{Ori}}[F]} \right\|_{2} }}$$
(6)

Secondly, to avoid the update of the BN layer weights offsetting the constraints on the expression parameters, the BN constraint term \(\Vert \gamma _{ad}\Vert _1\) is introduced. The \(\gamma _{ad}\) denotes weights of BN layer in the Conv-BN-ad and \(\Vert \cdot \Vert _1\) denotes the L1 normalization.

Thirdly, the sparse term \({\text{loss}}_{{{\text{sp}}}}\) is explained. There are cases where \({\text{loss}}_{{{\text{FLOPs}}}}\) pushes the expression parameter to a tiny value instead of zero, which is not desired for the proposal. Inspired by the work of LI et al. [48], the \({\text{loss}}_{{{\text{sp}}}}\) is introduced to create sparsity in the expression parameters.

$${\text{loss}}_{{{\text{sp}}}} = - \sum\limits_{{n = 1}}^{Q} {\left\| {\mathbb{R}^{n} - \bar{\mathbb{R}}^{n} } \right\|_{1} }$$
(7)

Equation 7 describes the \({\text{loss}}_{{{\text{sp}}}}\), where the \(\bar{\mathbb {R}^{n}}\) is the average of \(\mathbb {R}^{n}\). The \({\text{loss}}_{{{\text{sp}}}}\) term forces the expression parameters to increase their distance from each other. In the study, the expression parameter is pushed to 0 or 1.

$${\text{loss}}_{{{\text{GC}}}} = {\text{loss}}_{{{\text{cls}}}} + \frac{{{\mathbf{E}}[F]}}{{\left\| {{\mathbf{Ori}}[F]} \right\|_{2} }} - \sum\limits_{{n = 1}}^{Q} {\left\| {\mathbb{R}^{n} - \bar{\mathbb{R}}^{n} } \right\|_{1} } + {\text{ }}\left\| {\gamma _{{ad}} } \right\|_{1}$$
(8)

Accordingly, the proposed loss function \({\text{loss}}_{{{\text{GC}}}}\) is expressed as Eq. 8. The \({\text{loss}}_{{{\text{cls}}}}\) is the cross-entropy loss for classification. For the important channels, the \({\text{loss}}_{{{\text{cls}}}}\) would oppose \({\text{loss}}_{{{\text{FLOPs}}}}\), that the expression parameter decreases slowly. The coefficients of each term are detailed in section 4.

With the proposed \({\text{loss}}_{{{\text{GC}}}}\), the expression parameters of the less important channels are decreased. Note that during training, some terms of the \({\text{loss}}_{{{\text{GC}}}}\) are disabled to achieve better effects. This process is explained in detail before line 12 of Algorithm 1. First, the expression parameter is initialized with random numbers from a standard normal distribution, and the network is pre-trained with adjunct layers from scratch. Next, the network is trained with \({\text{loss}}_{{{\text{GC}}}}\). In the learning rate warmup stage, \({\text{loss}}_{{{\text{sp}}}}\) is turned off to prevent expression parameters from being incorrectly pushed to zero. When the percentage of closed channels reaches the preset \(P_{{{\text{target}}}}\), the \({\text{loss}}_{{{\text{FLOPs}}}}\) is turned off to fine-tune the expression parameters. In addition, if the precision of the results is low, the network is fine-tuned in short.

Algorithm 1
figure a

Algorithm of the proposed approach

3.2 Start-up latency reduction

Channel pruning is achieved in the GC stage. But The start-up latency problem is not yet resolved. Therefore, the SLR stage intends to prune the redundant residual blocks. Because the entire main path of the residual block is removed, this operation is very effective for start-up latency optimization, but it is also damaging to the network. Benefiting from the \({\text{loss}}_{{{\text{GC}}}}\), the contribution of the main path is weakened. As a result, the damage of removing the main path is also reduced.

We propose \({\text{Effect}}\) to represent the contribution of the residual block main path. \({\text{Effect}}\) is measured as follows: (1) First, randomly selected 1000 images from the training set as the subset \(\left\{ In_1, In_2, In_3,...,In_n \right\}\). (2) The subset is fed into the network, and the output of the main path of the residual block is recorded. \(\textbf{M}^{C}_d\) denotes the mean value of the \(C^{th}\) channel output when \(In_d\) is fed. (3) If the outputs of a channel are always similar, it is considered to have only little feature extraction capability. Thus, the contribution of the channels could be represented by the variance of \(\textbf{M}^{C}_d\). Then the \({\text{Effect}}\) of the \(N^{{{\text{th}}}}\) residual block is determined as:

$${\text{Effect}}^{{\text{N}}} = \frac{1}{{C_{{{\text{max}}}} }}\sum\nolimits_{{C = 1}}^{{C_{{{\text{max}}}} }} V ar\left( {{\mathbf{M}}_{1}^{C} ,{\mathbf{M}}_{2}^{C} ,{\mathbf{M}}_{3}^{C} ,...,{\mathbf{M}}_{d}^{C} } \right)$$
(9)

where \({\text{Var}}\left( \cdot \right)\) denotes the variance operator, \(C_{{{\text{max}}}}\) is the total number of channels of this residual block. The residual block with the lowest \({\text{Effect}}^{{\text{N}}}\) should be removed.

To minimize the loss of accuracy due to the residual block removal, two solutions are adopted: (1) Fine-tuning the network after the block is removed. (2) Expanding the convolutional channels in the adjacent residual blocks. Expanding the channels of the convolutional layer could improve the accuracy [49]. And when the channels being parallelized are less than the number of cores, the hardware utilization is low. Thus, expanding the convolutional layer could improve accuracy without increasing latency. In this study, convolutional channels are expanded until the number of channels is the multiple of the number of cores. The parameters of the expanded channels are duplicated from random existing channels. Besides, only layers that have been pruned in the GC stage are expanded.

Removing residual blocks is an iterative process. The SLR stage is described in detail after line 12 of the algorithm 1. Firstly, the \({\text{Effect}}\) of each residual block is calculated, and the residual block with the smallest \({\text{Effect}}\) is selected as L. Secondly, the prior residual block of L is expanded. Finally, the L is removed, and the new network is fine-tuned. The above steps are looped until the accuracy of the fine-tuning falls below the lower limit. The final result is selected based on both latency reduction and accuracy.

4 Experimental

4.1 Experimental configuration

Table 1 Experimental platform specifications

\(Datasets \ and \ network \ details\): the proposal is utilized to optimize the ResNet-20, ResNet-56 [20] and MobileNetV2 [51]. Considering the weaknesses of edge devices, Large Scale datasets are not suitable for the evaluation. Thus CIFAR10 and CIFAR100 [52] are adopted as the experimental datasets. CIFAR10 contains 50,000 training images and 10,000 test images with the size of 32\(\times\)32. The standard data augmentation [53] is adopted in the experiments: 4 pixel padding, cropping at random 32×32 location, and random horizontal flipping. ResNet-20 and ResNet-56 are used in this experiment, which are specialized for small image recognition. In ResNet-20 and ResNet-56, when the dimension of the output end of the residual connection increases, the DownsampleA layer is adopted. In detail, DownsampleA reduces the feature map size by an average pooling layer and pad the feature dimension with 0. Moreover, for the MobileNetV2, the stride in the first convolutional layer is modified from 2 to 1 to fit the small input size of CIFAR10 and CIFAR100.

\(Architecture \ modification \ strategy\): our modification target is the residual block. For ResNet, each residual block consists of two 3\(\times\)3 convolutional layers, each followed by one BN layer and one ReLU layer. For MobileNetV2, each residual block consists of three convolutional layers, the first and third have kernel size 1\(\times\)1. While the second one is called the Depthwise Convolution layer, It has the same number of input and output channels, and each filter only processes one channel.

For the ResNet, the adjunct layer is inserted following the first Conv-BN structure in each residual block, its channels are divided into 20 groups. For MobileNetV2, the adjunct layer is inserted following the Depthwise Convolution layer and the second BN in each residual block. Considering the Depthwise Convolution feature, each adjunct layer channel is divided into one group. In addition, the input and output channels of the Depthwise Convolution layer are pruned simultaneously. In the SLR stage, the residual block containing the convolutional layer with stride of 2 is not removed. Considering the number of cores for all processors, the channel expansion target is set to the multiple of 8.

\(Training \ configuration\): for all training in the experiment, a 5-epoch linear warmup is adopted. The process of GC is done with 1000 epochs, for the optimizer of network parameters, the cosine learning rate schedule, momentum of 0.9, and weight decay of 0.0005 are employed. For the expression parameter, the learning rate is constant at 0.02. In the proposed \({\text{loss}}_{{{\text{GC}}}}\) function, the factor of the \(\Vert \gamma _{ad}\Vert _1\) term is 0.001, and the factor of the \({\text{loss}}_{{{\text{sp}}}}\) term is 0.05 for ResNet-56, and 0.001 for lightweight networks MobileNetV2 and ResNet-20.

\(Deployment \ details\): all the training processes are performed on Nvidia GeForce GTX 3080 Ti GPU by PyTorch. The optimized networks are implemented on GAP8, FPGA ZCU102, desktop CPU i9-9700, and Raspberry Pi 4 Model B. Specifications of each platform are shown in Table 1.

The official deployment flow of each device is adopted in the experiments. For the deployment of GAP8, the optimized networks are quantified to INT-8 and converted to C code by NNTOOL. Then C code is optimized by AutoTiler. AutoTiler provides a variety of optimized algorithms, including convolutional layer and Depthwise Convolution layer. Finally GVSoC compiles the C code and simulates the behavior of the network on the GAP8. The latency is evaluated by measuring the working time of the cluster cores. If the network cannot be deployed due to memory capacity, it is divided into multiple segments, and the sum latency of each segment is taken as the expected latency. According to tests, the expected latency is about 5\(\%\) longer than the actual latency.

For the deployment on FPGA Zynq UltraScale+ MPSoC ZCU102, The network is first quantified to INT-8, fine-tuned based on a small amount of training set data, and compiled to Xmodel format by the Vitis-AI deployment environment. Then, launch the pre-built image via external storage and run the Xmodel file on the ZCU102. Latency is obtained by measuring the software execution time of the ZCU102 with the Vitis AI Profiler [54].

For the Intel i7-9700 desktop CPU platform and Raspberry Pi 4 Model B, the optimized network is implemented by PyTorch without Quantization. Latency is measured by PyTorch Profiler. Each measurement consisted of 50 warmups and 50 inferences, with the average time taken as the latency.

Moreover, the original DownsampleA layer is not supported by AutoTiler and Vitis-AI. Thus, it is replaced with equivalent operation before deployment. For the GAP8 platform, the expanded dimension in the DownsampleA layer is padded with the input multiplied by \(10^{-12}\). For the ZCU102 platform, the input is processed by a Depthwise Convolution layer with zero weights and its output is used to pad the expanded dimension. There is no difference in network output between these three implementations.

Table 2 Results of the three stages of the experiment

4.2 Experimental results

To reflect the optimization of each stage, the results optimized by GC stage only and GC+SLR stage are presented in Tables 2 and 3. First experimental results are evaluated in terms of the reduction of FLOPs, as Table 2. As can be seen, the GC+SLR method is effective for FLOP compression. The best result is achieved in MobileNetV2, which reduced FLOPs by 73.09% for CIFAR10 and 71.30% for CIFAR100 with about 1% accuracy drop.

Furthermore, experimental results show that plenty of residual blocks are removed with a minor decrease in precision by SLR method. In the MobileNetV2, 5 residual blocks are removed while there is no degradation in accuracy on the CIFAR10. And for ResNet-56, there are 10 residual blocks removed, while the accuracy of the network raise by 0.02% for CIFAR10.

Next, the experimental results are analyzed in terms of latency reduction. The optimized networks are deployed on the edge devices to evaluate the latency and Table 3 shows the latency. As can be observed, the latency of the network is significantly reduced by GC+SLR, with a maximum of 70.40% latency reduction. The acceleration effect is most obviously on the GAP8, which reduces latency by an average of 54.53%. Even for the lightweight network ResNet-20, 46.16% latency is reduced on CIFAR10. It is noted that the original ResNet-56 and MobileNetV2 could not be deployed on GAP8, while all the optimized networks are deployed properly.

More importantly, it could be noticed that significant latency reduction is realized in the SLR stage. On the desktop CPU platform, GC+SLR method achieves higher latency reduction than GC method, 26.21% higher on CIFAR10 and 37.68% higher on CIFAR100. With respect to the ZCU102 platform, GC+SLR method further reduces latency by 24.02% compared to GC method. About Raspberry Pi 4 Model B, the latency of ResNet-56 is further reduced by 28.65% for CIFAR10 by GC+SLR method.

The acceleration effect of SLR stage is not only caused by the reduction of FLOPs but also due to the reduction of start-up latency. About the results of MobileNetV2 on CIFAR10, GC+SLR method reduces FLOPs by only 0.22% (0.06M) more compared to GC method, but further reduces latency by 29.38% on desktop CPU, 8.24% on Raspberry Pi 4 Model B, and 6.06% on ZCU102. For the optimization of ResNet-56 on CIFAR10, latency is reduced by 21.82% (desktop CPU) and 21.90% (ZCU102) by the GC method, while it is reduced by 47.28% (desktop CPU) and 43.25% (ZCU102) by the GC+SLR method. It should be noted that the GC+SLR method only reduces FLOPs by 6.17% more compared to the GC method. Overall, the SLR method is not effective in FLOPs reduction, but it is effective in latency reduction. These results demonstrate that start-up latency is serious, and the SLR method is effective in optimizing start-up latency.

Table 3 Latency of the optimized network on each platform

4.3 Comparison

Table 4 Comparison in terms of accuracy and FLOPs on CIFAR-10

The proposal is compared with other channel pruning methods. Since most works do not publish the detailed architecture of the compression model, it is difficult to exactly reproduce the results of other work. Therefore, the proposal is compared in terms of accuracy and FLOPs with other methods. The results of ResNet-20 and ResNet-56 on CIFAR10 are compared, as shown in Table 4. For the ResNet-56, it is noticeable that the network optimized by the proposal has a higher accuracy than most other methods, only 0.14% lower than LFPC. In terms of the compression effect of FLOPs, the proposal method is higher than the other proposals, with a maximum of 23.87% more reduction in FLOPs. For the ResNet-20, The network optimized by the proposal achieves about 1% higher accuracy compared to other methods, while the FLOPs compression ratio is only about 5% lower.

5 Conclusion

We propose a generic deep learning architecture optimization method to achieve further latency reduction of CNNs on edge devices. We analyze the low-level operations of the convolutional layer and find that existing channel pruning methods have limited effect on the start-up latency optimization. Thus, The network is optimized in two proposed phases. In the Global Constraint stage, the constraint is applied to the output of each residual block, and the low-importance channels are pruned. Next, in the Start-up Latency Reduction stage, the redundant residual blocks are identified and pruned with minimal precision drop. The optimized CNNs are deployed on desktop CPU, Raspberry Pi 4 Model B, GAP8, and FPGA ZCU102. The experimental results show that up to 70.40% of latency is reduced. Furthermore, we demonstrate that the reduction in latency is not only due to the reduction in FLOPs.