Abstract
In the promising Artificial Intelligence of Things technology, deep learning algorithms are implemented on edge devices to process data locally. However, high-performance deep learning algorithms are accompanied by increased computation and parameter storage costs, leading to difficulties in implementing huge deep learning algorithms on memory and power constrained edge devices, such as smartphones and drones. Thus various compression methods are proposed, such as channel pruning. According to the analysis of low-level operations on edge devices, existing channel pruning methods have limited effect on latency optimization. Due to data processing operations, the pruned residual blocks still result in significant latency, which hinders real-time processing of CNNs on edge devices. Hence, we propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The network is optimized in two stages, Global Constraint and Start-up Latency Reduction, and pruning of both channels and residual blocks is achieved. Optimized networks are evaluated on desktop CPU, FPGA, ARM CPU, and PULP platforms. The experimental results show that the latency is reduced by up to 70.40%, which is 13.63% higher than only applying channel pruning and achieving real-time processing in the edge device.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The Artificial Intelligence of Things (AIoT), a promising integrated technology that combines artificial intelligence and Intelligence of Things, is drawing significant interest [1]. However, feedback from AIoT systems usually has unacceptable latency due to the limited bandwidth of the network and instability of communication [2, 3]. The current trend is to implement deep learning algorithms on edge devices that process the raw data close to the data source [4].
Convolutional neural networks (CNNs) are highly regarded among deep learning technologies due to their impressive performance in a variety of applications such as object recognition [5,6,7], healthcare [8, 9], image generation [10] and anomaly detection [11, 12]. The development of CNNs nowadays is accompanied by increasing memory usage and computational complexity. Whereas edge devices are heavily constrained in computation power, memory bandwidth, and power consumption [13]. As a result, the latency of the full CNN algorithm on an edge device is normally unacceptable.
Compression and quantization are commonly required before deploying CNN-based applications onto edge devices [14]. The CNN compression methods include: channel pruning [15], knowledge distillation [16, 17], matrix decomposition and so on. The channel pruning methods aim to identify less important channels (i.e., filters) and remove them.
Despite the channel pruning working well on reducing Floating Point Operations (FLOPs), such method has limited effect on latency optimization. We divide the low-level operations of the convolutional layer into matrix–vector multiplication (MVM) operations and data processing operations, as shown in Fig. 1. The MVM operations are mainly from the convolution between the filters and the inputs. They take up most of the FLOPs [18], therefore are the main compression target of channel pruning. Data processing operations are performed before and after the MVM operations, including padding of feature maps, rearrangement of the input feature map (Im2col), re-quantization of the output, and storage of results [19]. The latency due to data processing operations is defined as start-up latency.
Figure 2 demonstrates the variation in latency and FLOPs when pruning the output channel of convolutional layer. It could be seen that reduction in latency is not as significant as FLOPs. Even when there is only one output channel left, 80% latency remains, which is the limit of channel pruning. Pruning the output channels effectively reduces MVM operations, but does not optimize data processing operations at the inputs. The limitation of pruning results in significant start-up latency remaining.
To effectively optimize start-up latency, the data processing operations on both the input and output sides should be reduced. However, mainstream pruning strategies have difficulty in achieving this goal. In recent years, a significant number of networks have adopted the design of residual blocks [20]. Residual block consists of two parts: the main path and the residual connection. The main path is composed of multiple weight layers, including convolution, batch normalization (BN), and activation layers. The residual connection adds the input directly to the output of the main path, which requires that the input and output tensor shapes are the same. Consequently, the pruning strategy for residual blocks adopted by most studies [21, 22] is to keep the entire input channels of the first layer and the output channels of the last layer. To further the latency reduction, it is worthwhile to improve this pruning strategy to achieve a further reduction in latency.
This fact encourages us to propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The CNNs are optimized in two stages: Global Constraint (GC) and Start-up Latency Reduction (SLR). The GC stage aims to achieve lossless channel pruning. The main paths are constrained by the adjunct layers, and the expression of redundant channels is blocked. Then, the adjunct layers are equivalently converted into the BN layers to achieve channel pruning. Next, the SLR stage aims to optimize start-up latency. Residual blocks that do not function efficiently due to constraints are identified and pruned in SLR stage. Finally, the optimized network is implemented on multiple platforms, and the reduction in latency and FLOPs is evaluated.
The main contributions of the paper are as follows:
-
We improve the mainstream pruning strategy to further reduce latency. Experimental results show that this approach optimizes more latency than channel pruning alone.
-
We propose a general CNN application acceleration approach. The optimized network deployed on desktop CPU [23], GAP8 [24], FPGA ZCU102 [25], and Raspberry Pi 4 Model B [26] platforms through official deployment flow. Significant latency reductions are achieved without application-specific optimizations.
The rest of the paper is organized as follows: Sect. 2 introduces related works. Section 3 details the methodology. Section 4 shows the experimental results. Section 5 concludes this paper.
2 Related work
2.1 Channel pruning
Channel pruning is a common method for compressing CNN algorithms. Since the performance of CNNs depends on a huge number of parameters, the challenge of channel pruning is the trade-off between accuracy and compression effect.
To minimize performance loss, removing the low-importance channel [27, 28] or feature map [29,30,31] is a practical solution. Kuang et al. [21] believe it is a better choice to directly measure the effect of filters on task-related loss function than based on the magnitude of weights. In detail, the network is randomly pruned several times, and the actual effect of the corresponding filter on the network is calculated by the proposed task-related loss function after each channel is removed. Finally, the channels with low effect obtained from the measurements are discarded until the FLOPs reach the target, then the pruned network is fine-tuned.
Some methods [32,33,34,35] introduce the regularization term in the optimization objective to create sparsity in parameters. Chen et al. [36] propose a collaborative channel pruning method. They found that many previous studies evaluated channel importance only based on a single structure, which may lead to the mistakenly removing of important channels. Therefore, L1 regularization is introduced to the convolutional layer weights and BN layer weights, respectively, to enhance sparsity. Next, the effect of the channel is evaluated according to both the convolutional layer and the corresponding BN layer, and channels with low effects are removed. Atashgahi et al. [37] inspired by biological brain evolution and Hebbian learning theory, propose a sparse training method based on the behavior of neurons. Specifically, at each epoch, the weights with the smallest magnitude are dropped, and then the most important connections to be added are obtained based on cosine similarity of each pair of neurons in two consecutive layers.
In some methods channel pruning is converted to other optimization problems. Ding et al. [38] consider that pruning methods degrade the performance of the network since they introduce desired structural sparsity to the network, which may change the objective of the optimization, resulting in parameters far from optimal. Thus, the compactor, \(1 \times 1\) convolutional layer, is inserted following convolutional layer. Channel pruning is converted into the compactor sparse problem. The penalty is applied to compressors to create sparsity. Then, channel pruning is achieved by parameter merging. Guo et al. [39] propose Differentiable Markov Channel Pruning (DMCP). In DMCP, channel pruning is modeled as the Markov process, where the retention of each channel is considered as a state, and pruning represents the transition between states. The probability of retaining the \((k + 1)^{{{\text{th}}}}\) channel when the \(k^{{{\text{th}}}}\) channel is retained is expressed as a learnable parameter. With this approach, the optimal architecture is explored during training.
However, the input and output channels of the residual block are not modified in these proposals, which means that latency could be further optimized.
2.2 Deployment
We summarize the common deployment flow as the following steps: (1) network compression, including channel pruning and knowledge distillation. (2) Quantization. Generally, quantize network parameters to INT8 to reduce memory access latency [40]. (3) Computation node fusion. The network is transformed into a computation graph, each operation of the original network corresponds to one computation node. Normally, less compute-intensive nodes, such as the BN and activation layers, are fused into convolutional layer nodes to reduce start-up latency. Since the complexity of convolutional layers, the fusion of the two convolutional layers into one simpler node is not available in existing deployment methods. (4) Compiling. This step includes hardware-level optimizations such as instruction scheduling, memory reuse, etc. Where steps (3) and (4) are highly dependent on the CNNs acceleration libraries provided by the developer [41].
Each framework and device has official deployment tools provided by the developer, here introduce deployment tools of two edge devices are introduced. GAP8 is an IoT application processor based on RISC-V and PULP platform [42], developed by GreenWaves Technologies, which is featured by low power consumption and parallel processing. The developer also provides a deployment toolset GAP flow [43]. Developers can quantify and deploy CNNs using deployment tools NNTOOL [44] and AutoTiler [45] and simulate them on GVSoC [46]. Development environment Vitis-AI [25] accelerates AI inference on AMD hardware platforms, including FPGA Zynq UltraScale+ MPSoC ZCU102 [47]. It includes optimized IP cores, AI Quantizer for quantifying CNNs, and an AI Compiler for optimizing and compiling computational graph of CNNs.
3 Methods
Removing residual blocks from the network with low accuracy loss is a challenge. The output of each residual block is obtained from main path and skip connect, the loss of features from each path would cause unacceptable harm. Therefore, there are two stages of the optimization, GC and SLR. First, the GC stage aims to achieve channel pruning, where constraints are added to the main paths, weakening the effect of the redundant main paths. Next, the pruned network is further optimized by the SLR stage to reduce the start-up latency. SLR stage aims to prune redundant residual blocks. Benefiting from the constraints of GC stage, the damage due to pruning is minimized. The optimization process is illustrated in Fig. 3.
3.1 Global constraint
3.1.1 Definition of adjunct layer
To convert the optimization of the model structure into an optimizable problem, adjunct layers are introduced. The convolutional layer and the following BN layer are defined as Conv-BN. Adjunct layer is inserted after the Conv-BN and form the Conv-BN-ad structure, as shown in the Fig. 4. The insertion location is described in detail in Sect. 4.1. The output channels of adjunct layer are divided into groups, and output of each group is managed by a learnable parameter called expression parameter, which is limited to [0, 1]. With the introduction of the expression parameter, the pruning of channel could be considered as a continuous variation.
Next, the output of Conv-BN-ad is explained. Let \(\gamma _i\) and \(\beta _i\) denote the weight and bias of the \(i^{{{\text{th}}}}\) channel of BN layer, \(\mu _{B}\) and \(\sigma _{B}\) are the batch mean and variance, \(\epsilon\) is an arbitrarily small constant, \(\mathbb {R}_k\) is the \(k^{{{\text{th}}}}\) group expression parameter corresponding to the \(i^{{{\text{th}}}}\) channel. Then the \(i^{{{\text{th}}}}\) output of the adjunct layer is represented as:
To avoid extra cost to the edge device, additional structures should be removed before deployment. The adjunct layers are removed by two steps: parameter merging and channel pruning. First, according to Eq. 1, it is possible to merge \(\mathbb {R}_k\) into the weight and bias of the BN layer. Merged weight \(\hat{\gamma _i}\) and bias \(\hat{\beta _i}\) are obtained as:
Second, channels are pruned according to \(\hat{\gamma _i}\). If the absolute of \(\hat{\gamma _i}\) is close to zero, then the output of \(i^{{{\text{th}}}}\) channel could be pruned harmlessly. In the experiment, the channels with \(\hat{\gamma _i}\) less than the \({\text{threshold}}_{{{\text{BN}}}}\) are pruned. The \({\text{threshold}}_{{{\text{BN}}}}\) is set to \(1 \times 10 ^{-3}\). According to our measurements, pruning based on this threshold causes minor damage to accuracy. Finally, the adjunct layers are removed.
3.1.2 Loss function
The GC stage is intended to add constraints to the expression parameters and achieve lossless pruning. To this end, the loss function \({\text{loss}}_{{{\text{GC}}}}\) is proposed. It contains three parts, the FLOPs constraint term \({\text{losss}}_{{{\text{FLOPs}}}}\), the sparse term \({\text{loss}}_{{{\text{sp}}}}\) and the BN constraint term \(\Vert \gamma _{ad}\Vert _1\).
First, the FLOPs constraint term \({\text{losss}}_{{{\text{FLOPs}}}}\) is explained. In the channel pruning, the change of FLOPs is discontinuous, which makes it difficult to optimize by gradient descent. So channel pruning is converted into a continuous process by the assistance of the expression parameters.
In detail, FLOPs are calculated based on the expected number of out channels \({\mathbf{E}}^{n} [{\text{Out}}]\). The \({\mathbf{E}}^{n} [{\text{Out}}]\) for the \(n^{{{\text{th}}}}\) Conv-BN-ad could be computed as:
where \({\text{Channel}}_{{\text{k}}}\) is the number of channels in \(k^{{{\text{th}}}}\) group, and G is the number of output channel groups.
And for the Conv-BN structure that does not contain an adjunct layer, \({\mathbf{E}}^{n} [{\text{Out}}]\) is its actual number of channels.
After determining the \({\mathbf{E}}^{n} [{\text{Out}}]\), the FLOPs of the \(n^{{{\text{th}}}}\) convolutional layer are calculated as:
in which \(W_{{{\text{Out}}}}^{{\text{n}}}\) and \(H_{{{\text{Out}}}}^{{\text{n}}}\) are the width and height of the output feature map and \(K^{n}\) is the size of the convolutional kernel. groups denotes the number of groups for grouped convolution.
After clarifying the expected FLOPs of each convolutional layer, for the set Q of all convolutional layers in the residual block, the total expected FLOPs \(\textbf{E}[F]\) is:
Then the \({\text{loss}}_{{{\text{FLOPs}}}}\) is introduced as Eq. 6. Here, the value of \(\textbf{E}[F]\) is commonly excessive compared to other loss. Thus, the L2 norm of the original FLOPs \(\textbf{Ori}[F]\) of the network is introduced to normalize \(\textbf{E}[F]\).
Secondly, to avoid the update of the BN layer weights offsetting the constraints on the expression parameters, the BN constraint term \(\Vert \gamma _{ad}\Vert _1\) is introduced. The \(\gamma _{ad}\) denotes weights of BN layer in the Conv-BN-ad and \(\Vert \cdot \Vert _1\) denotes the L1 normalization.
Thirdly, the sparse term \({\text{loss}}_{{{\text{sp}}}}\) is explained. There are cases where \({\text{loss}}_{{{\text{FLOPs}}}}\) pushes the expression parameter to a tiny value instead of zero, which is not desired for the proposal. Inspired by the work of LI et al. [48], the \({\text{loss}}_{{{\text{sp}}}}\) is introduced to create sparsity in the expression parameters.
Equation 7 describes the \({\text{loss}}_{{{\text{sp}}}}\), where the \(\bar{\mathbb {R}^{n}}\) is the average of \(\mathbb {R}^{n}\). The \({\text{loss}}_{{{\text{sp}}}}\) term forces the expression parameters to increase their distance from each other. In the study, the expression parameter is pushed to 0 or 1.
Accordingly, the proposed loss function \({\text{loss}}_{{{\text{GC}}}}\) is expressed as Eq. 8. The \({\text{loss}}_{{{\text{cls}}}}\) is the cross-entropy loss for classification. For the important channels, the \({\text{loss}}_{{{\text{cls}}}}\) would oppose \({\text{loss}}_{{{\text{FLOPs}}}}\), that the expression parameter decreases slowly. The coefficients of each term are detailed in section 4.
With the proposed \({\text{loss}}_{{{\text{GC}}}}\), the expression parameters of the less important channels are decreased. Note that during training, some terms of the \({\text{loss}}_{{{\text{GC}}}}\) are disabled to achieve better effects. This process is explained in detail before line 12 of Algorithm 1. First, the expression parameter is initialized with random numbers from a standard normal distribution, and the network is pre-trained with adjunct layers from scratch. Next, the network is trained with \({\text{loss}}_{{{\text{GC}}}}\). In the learning rate warmup stage, \({\text{loss}}_{{{\text{sp}}}}\) is turned off to prevent expression parameters from being incorrectly pushed to zero. When the percentage of closed channels reaches the preset \(P_{{{\text{target}}}}\), the \({\text{loss}}_{{{\text{FLOPs}}}}\) is turned off to fine-tune the expression parameters. In addition, if the precision of the results is low, the network is fine-tuned in short.
3.2 Start-up latency reduction
Channel pruning is achieved in the GC stage. But The start-up latency problem is not yet resolved. Therefore, the SLR stage intends to prune the redundant residual blocks. Because the entire main path of the residual block is removed, this operation is very effective for start-up latency optimization, but it is also damaging to the network. Benefiting from the \({\text{loss}}_{{{\text{GC}}}}\), the contribution of the main path is weakened. As a result, the damage of removing the main path is also reduced.
We propose \({\text{Effect}}\) to represent the contribution of the residual block main path. \({\text{Effect}}\) is measured as follows: (1) First, randomly selected 1000 images from the training set as the subset \(\left\{ In_1, In_2, In_3,...,In_n \right\}\). (2) The subset is fed into the network, and the output of the main path of the residual block is recorded. \(\textbf{M}^{C}_d\) denotes the mean value of the \(C^{th}\) channel output when \(In_d\) is fed. (3) If the outputs of a channel are always similar, it is considered to have only little feature extraction capability. Thus, the contribution of the channels could be represented by the variance of \(\textbf{M}^{C}_d\). Then the \({\text{Effect}}\) of the \(N^{{{\text{th}}}}\) residual block is determined as:
where \({\text{Var}}\left( \cdot \right)\) denotes the variance operator, \(C_{{{\text{max}}}}\) is the total number of channels of this residual block. The residual block with the lowest \({\text{Effect}}^{{\text{N}}}\) should be removed.
To minimize the loss of accuracy due to the residual block removal, two solutions are adopted: (1) Fine-tuning the network after the block is removed. (2) Expanding the convolutional channels in the adjacent residual blocks. Expanding the channels of the convolutional layer could improve the accuracy [49]. And when the channels being parallelized are less than the number of cores, the hardware utilization is low. Thus, expanding the convolutional layer could improve accuracy without increasing latency. In this study, convolutional channels are expanded until the number of channels is the multiple of the number of cores. The parameters of the expanded channels are duplicated from random existing channels. Besides, only layers that have been pruned in the GC stage are expanded.
Removing residual blocks is an iterative process. The SLR stage is described in detail after line 12 of the algorithm 1. Firstly, the \({\text{Effect}}\) of each residual block is calculated, and the residual block with the smallest \({\text{Effect}}\) is selected as L. Secondly, the prior residual block of L is expanded. Finally, the L is removed, and the new network is fine-tuned. The above steps are looped until the accuracy of the fine-tuning falls below the lower limit. The final result is selected based on both latency reduction and accuracy.
4 Experimental
4.1 Experimental configuration
\(Datasets \ and \ network \ details\): the proposal is utilized to optimize the ResNet-20, ResNet-56 [20] and MobileNetV2 [51]. Considering the weaknesses of edge devices, Large Scale datasets are not suitable for the evaluation. Thus CIFAR10 and CIFAR100 [52] are adopted as the experimental datasets. CIFAR10 contains 50,000 training images and 10,000 test images with the size of 32\(\times\)32. The standard data augmentation [53] is adopted in the experiments: 4 pixel padding, cropping at random 32×32 location, and random horizontal flipping. ResNet-20 and ResNet-56 are used in this experiment, which are specialized for small image recognition. In ResNet-20 and ResNet-56, when the dimension of the output end of the residual connection increases, the DownsampleA layer is adopted. In detail, DownsampleA reduces the feature map size by an average pooling layer and pad the feature dimension with 0. Moreover, for the MobileNetV2, the stride in the first convolutional layer is modified from 2 to 1 to fit the small input size of CIFAR10 and CIFAR100.
\(Architecture \ modification \ strategy\): our modification target is the residual block. For ResNet, each residual block consists of two 3\(\times\)3 convolutional layers, each followed by one BN layer and one ReLU layer. For MobileNetV2, each residual block consists of three convolutional layers, the first and third have kernel size 1\(\times\)1. While the second one is called the Depthwise Convolution layer, It has the same number of input and output channels, and each filter only processes one channel.
For the ResNet, the adjunct layer is inserted following the first Conv-BN structure in each residual block, its channels are divided into 20 groups. For MobileNetV2, the adjunct layer is inserted following the Depthwise Convolution layer and the second BN in each residual block. Considering the Depthwise Convolution feature, each adjunct layer channel is divided into one group. In addition, the input and output channels of the Depthwise Convolution layer are pruned simultaneously. In the SLR stage, the residual block containing the convolutional layer with stride of 2 is not removed. Considering the number of cores for all processors, the channel expansion target is set to the multiple of 8.
\(Training \ configuration\): for all training in the experiment, a 5-epoch linear warmup is adopted. The process of GC is done with 1000 epochs, for the optimizer of network parameters, the cosine learning rate schedule, momentum of 0.9, and weight decay of 0.0005 are employed. For the expression parameter, the learning rate is constant at 0.02. In the proposed \({\text{loss}}_{{{\text{GC}}}}\) function, the factor of the \(\Vert \gamma _{ad}\Vert _1\) term is 0.001, and the factor of the \({\text{loss}}_{{{\text{sp}}}}\) term is 0.05 for ResNet-56, and 0.001 for lightweight networks MobileNetV2 and ResNet-20.
\(Deployment \ details\): all the training processes are performed on Nvidia GeForce GTX 3080 Ti GPU by PyTorch. The optimized networks are implemented on GAP8, FPGA ZCU102, desktop CPU i9-9700, and Raspberry Pi 4 Model B. Specifications of each platform are shown in Table 1.
The official deployment flow of each device is adopted in the experiments. For the deployment of GAP8, the optimized networks are quantified to INT-8 and converted to C code by NNTOOL. Then C code is optimized by AutoTiler. AutoTiler provides a variety of optimized algorithms, including convolutional layer and Depthwise Convolution layer. Finally GVSoC compiles the C code and simulates the behavior of the network on the GAP8. The latency is evaluated by measuring the working time of the cluster cores. If the network cannot be deployed due to memory capacity, it is divided into multiple segments, and the sum latency of each segment is taken as the expected latency. According to tests, the expected latency is about 5\(\%\) longer than the actual latency.
For the deployment on FPGA Zynq UltraScale+ MPSoC ZCU102, The network is first quantified to INT-8, fine-tuned based on a small amount of training set data, and compiled to Xmodel format by the Vitis-AI deployment environment. Then, launch the pre-built image via external storage and run the Xmodel file on the ZCU102. Latency is obtained by measuring the software execution time of the ZCU102 with the Vitis AI Profiler [54].
For the Intel i7-9700 desktop CPU platform and Raspberry Pi 4 Model B, the optimized network is implemented by PyTorch without Quantization. Latency is measured by PyTorch Profiler. Each measurement consisted of 50 warmups and 50 inferences, with the average time taken as the latency.
Moreover, the original DownsampleA layer is not supported by AutoTiler and Vitis-AI. Thus, it is replaced with equivalent operation before deployment. For the GAP8 platform, the expanded dimension in the DownsampleA layer is padded with the input multiplied by \(10^{-12}\). For the ZCU102 platform, the input is processed by a Depthwise Convolution layer with zero weights and its output is used to pad the expanded dimension. There is no difference in network output between these three implementations.
4.2 Experimental results
To reflect the optimization of each stage, the results optimized by GC stage only and GC+SLR stage are presented in Tables 2 and 3. First experimental results are evaluated in terms of the reduction of FLOPs, as Table 2. As can be seen, the GC+SLR method is effective for FLOP compression. The best result is achieved in MobileNetV2, which reduced FLOPs by 73.09% for CIFAR10 and 71.30% for CIFAR100 with about 1% accuracy drop.
Furthermore, experimental results show that plenty of residual blocks are removed with a minor decrease in precision by SLR method. In the MobileNetV2, 5 residual blocks are removed while there is no degradation in accuracy on the CIFAR10. And for ResNet-56, there are 10 residual blocks removed, while the accuracy of the network raise by 0.02% for CIFAR10.
Next, the experimental results are analyzed in terms of latency reduction. The optimized networks are deployed on the edge devices to evaluate the latency and Table 3 shows the latency. As can be observed, the latency of the network is significantly reduced by GC+SLR, with a maximum of 70.40% latency reduction. The acceleration effect is most obviously on the GAP8, which reduces latency by an average of 54.53%. Even for the lightweight network ResNet-20, 46.16% latency is reduced on CIFAR10. It is noted that the original ResNet-56 and MobileNetV2 could not be deployed on GAP8, while all the optimized networks are deployed properly.
More importantly, it could be noticed that significant latency reduction is realized in the SLR stage. On the desktop CPU platform, GC+SLR method achieves higher latency reduction than GC method, 26.21% higher on CIFAR10 and 37.68% higher on CIFAR100. With respect to the ZCU102 platform, GC+SLR method further reduces latency by 24.02% compared to GC method. About Raspberry Pi 4 Model B, the latency of ResNet-56 is further reduced by 28.65% for CIFAR10 by GC+SLR method.
The acceleration effect of SLR stage is not only caused by the reduction of FLOPs but also due to the reduction of start-up latency. About the results of MobileNetV2 on CIFAR10, GC+SLR method reduces FLOPs by only 0.22% (0.06M) more compared to GC method, but further reduces latency by 29.38% on desktop CPU, 8.24% on Raspberry Pi 4 Model B, and 6.06% on ZCU102. For the optimization of ResNet-56 on CIFAR10, latency is reduced by 21.82% (desktop CPU) and 21.90% (ZCU102) by the GC method, while it is reduced by 47.28% (desktop CPU) and 43.25% (ZCU102) by the GC+SLR method. It should be noted that the GC+SLR method only reduces FLOPs by 6.17% more compared to the GC method. Overall, the SLR method is not effective in FLOPs reduction, but it is effective in latency reduction. These results demonstrate that start-up latency is serious, and the SLR method is effective in optimizing start-up latency.
4.3 Comparison
The proposal is compared with other channel pruning methods. Since most works do not publish the detailed architecture of the compression model, it is difficult to exactly reproduce the results of other work. Therefore, the proposal is compared in terms of accuracy and FLOPs with other methods. The results of ResNet-20 and ResNet-56 on CIFAR10 are compared, as shown in Table 4. For the ResNet-56, it is noticeable that the network optimized by the proposal has a higher accuracy than most other methods, only 0.14% lower than LFPC. In terms of the compression effect of FLOPs, the proposal method is higher than the other proposals, with a maximum of 23.87% more reduction in FLOPs. For the ResNet-20, The network optimized by the proposal achieves about 1% higher accuracy compared to other methods, while the FLOPs compression ratio is only about 5% lower.
5 Conclusion
We propose a generic deep learning architecture optimization method to achieve further latency reduction of CNNs on edge devices. We analyze the low-level operations of the convolutional layer and find that existing channel pruning methods have limited effect on the start-up latency optimization. Thus, The network is optimized in two proposed phases. In the Global Constraint stage, the constraint is applied to the output of each residual block, and the low-importance channels are pruned. Next, in the Start-up Latency Reduction stage, the redundant residual blocks are identified and pruned with minimal precision drop. The optimized CNNs are deployed on desktop CPU, Raspberry Pi 4 Model B, GAP8, and FPGA ZCU102. The experimental results show that up to 70.40% of latency is reduced. Furthermore, we demonstrate that the reduction in latency is not only due to the reduction in FLOPs.
Data availability
Data availability is not applicable to this article as no new data were created or analyzed in this study.
References
Chang, Z., Liu, S., Xiong, X., Cai, Z., Tu, G.: A survey of recent advances in edge-computing-powered artificial intelligence of things. IEEE Internet Things J. 8(18), 13849–13875 (2021)
Kopetz, H., Steiner, W.: Internet of Things, pp. 325–341. Springer, Cham (2022)
Wang, X., Magno, M., Cavigelli, L., Benini, L.: Fann-on-mcu: an open-source toolkit for energy-efficient neural network inference at the edge of the internet of things. IEEE Internet Things J. 7(5), 4403–4417 (2020). https://doi.org/10.1109/JIOT.2020.2976702
Mittal, S.: A survey on optimized implementation of deep learning models on the nvidia jetson platform. J. Syst. Architect. 97, 428–442 (2019). https://doi.org/10.1016/j.sysarc.2019.01.011
Yue, X., Li, H., Meng, L.: An ultralightweight object detection network for empty-dish recycling robots. IEEE Trans. Instrum. Meas. 72, 1–12 (2023)
Yue, X., Meng, L.: Yolo-msa: A multi-scale stereoscopic attention network for empty-dish recycling robots. IEEE Transactions on Instrumentation and Measurement 72, 1–14 (2023)
Yang, Q., Meng, H., Gao, Y., Gao, D.: A real-time object detection method for underwater complex environments based on fasternet-yolov7. J. Real-Time Image Proc. 21(1), 8 (2023). https://doi.org/10.1007/s11554-023-01387-4
Ge, Y., Li, Z., Yue, X., Li, H., Li, Q., Meng, L.: Iot-based automatic deep learning model generation and the application on empty-dish recycling robots. Internet of Things 25, 101047 (2023)
Ren, J., Wang, A., Li, H., Yue, X., Meng, L.: A transformer-based neural network for gait prediction in lower limb exoskeleton robots using plantar force. Sensors 23(14), 6547 (2023)
Kaneko, H., Ishibashi, R., Meng, L.: Deteriorated characters restoration for early Japanese books using enhanced cyclegan. Heritage 6(5), 4345–4361 (2023)
Li, Z., Ge, Y., Wang, X., Yue, X., Meng, L.: Industrial anomaly detection via teacher student network. In: 2023 International Conference on Advanced Mechatronic Systems (ICAMechS), pp. 1–5 (2023) IEEE
Ardiyanto, I.: Edge devices-oriented surface defect segmentation by ghostnet fusion block and global auxiliary layer. J. Real-Time Image Proc. 21(1), 13 (2023). https://doi.org/10.1007/s11554-023-01394-5
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey and benchmarking of machine learning accelerators. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–9 (2019) IEEE
Chen, Y., Zheng, B., Zhang, Z., Wang, Q., Shen, C., Zhang, Q.: Deep learning on mobile and embedded devices: state-of-the-art, challenges, and future directions. ACM Comput. Surv. (CSUR) 53(4), 1–37 (2020)
Li, H., Meng, L.: Hardware-aware approach to deep neural network optimization. Neurocomputing 559, 126808 (2023)
Li, Z., Li, H., Meng, L.: Model compression for deep neural networks: a survey. Computers 12(3), 60 (2023)
Chen, J., Mao, Q., Bao, Y., Huang, Y., Meng, F., Liang, Y.: Lightweight parameter de-redundancy demoiréing network with adaptive wavelet distillation. J. Real-Time Image Proc. 21(1), 6 (2023). https://doi.org/10.1007/s11554-023-01386-5
Cheng, Y., Wang, D., Zhou, P., Zhang, T.: Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process. Mag. 35(1), 126–136 (2018)
Nagel, M., Fournarakis, M., Amjad, R.A., Bondarenko, Y., van Baalen, M., Blankevoort, T.: A white paper on neural network quantization. arXiv e-prints, 2106–08295 (2021) https://doi.org/10.48550/arXiv.2106.08295 [cs.LG]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kuang, J., Shao, M., Wang, R., Zuo, W., Ding, W.: Network pruning via probing the importance of filters. Int. J. Mach. Learn. Cybern. 13(9), 2403–2414 (2022)
Li, Y., Gu, S., Mayer, C., Gool, L.V., Timofte, R.: Group sparsity: the hinge between filter pruning and decomposition for network compression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018–8027 (2020)
Intel: Intel Core i7-9700 Processor. https://www.intel.com/content/www/us/en/products/details/processors/core/i7.html. Accessed 03 Apr 2024
Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., Rotenberg, F., Benini, L.: Gap-8: A risc-v soc for ai at the edge of the iot. In: 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 1–4 (2018) IEEE
Kathail, V.: Xilinx vitis unified software platform. In: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’20, pp. 173–174. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3373087.3375887
Upton, E., Halfacree, G.: Raspberry Pi User Guide. John Wiley, Hoboken (2016)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019). https://openreview.net/forum?id=rJl-b3RcF7. Accessed 12 Jan 2024
He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019)
Sui, Y., Yin, M., Xie, Y., Phan, H., Zonouz, S., Yuan, B.: CHIP: CHannel independence-based pruning for compact neural networks, vol. 29, pp. 24604–24616 (2021). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85127800793 &partnerID=40 &md5=7b58c749bd99ef797e6ace65945782be. Accessed 20 Dec 2023
Tang, Y., Wang, Y., Xu, Y., Deng, Y., Xu, C., Tao, D., Xu, C.: Manifold regularized dynamic network pruning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, June 19-25, 2021, pp. 5018–5028. Computer Vision Foundation / IEEE, (2021). https://doi.org/10.1109/CVPR46437.2021.00498 . https://openaccess.thecvf.com/content/CVPR2021/html/Tang_Manifold_Regularized_Dynamic_Network_Pruning_CVPR_2021_paper.html. Accessed 04 Jan 2024
Lin, M., Ji, R., Wang, Y., Zhang, Y., Zhang, B., Tian, Y., Shao, L.: Hrank: filter pruning using high-rank feature map. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 1526–1535. Computer Vision Foundation / IEEE, (2020). https://doi.org/10.1109/CVPR42600.2020.00160 . https://openaccess.thecvf.com/content_CVPR_2020/html/Lin_HRank_Filter_Pruning_Using_High-Rank_Feature_Map_CVPR_2020_paper.html. Accessed 20 Dec 2024
Jorge, P., Sanyal, A., Behl, H.S., Torr, P.H.S., Rogez, G., Dokania, P.K.: Progressive skeletonization: trimming more fat from a network at initialization. CoRR abs/2006.09081 (2020). https://arxiv.org/abs/2006.09081
Raihan, M.A., Aamodt, T.: Sparse weight activation training. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15625–15638. Curran Associates, Inc., (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/b44182379bf9fae976e6ae5996e13cd8-Paper.pdf. Accessed 14 Nov 2024
Kusupati, A., Ramanujan, V., Somani, R., Wortsman, M., Jain, P., Kakade, S., Farhadi, A.: Soft threshold weight reparameterization for learnable sparsity. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 5544–5555. PMLR, (2020). https://proceedings.mlr.press/v119/kusupati20a.html. Accessed 4 Jan 2024
Liu, J., Xu, Z., Shi, R., Cheung, R.C.C., So, H.K.: Dynamic sparse training: find efficient sparse network from scratch with trainable masked layers. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, (2020). https://openreview.net/forum?id=SJlbGJrtDB. Accessed 4 Jan 2024
Chen, Y., Wen, X., Zhang, Y., Shi, W.: Ccprune: collaborative channel pruning for learning compact convolutional networks. Neurocomputing 451, 35–45 (2021)
Atashgahi, Z., Pieterse, J., Liu, S., Mocanu, D.C., Veldhuis, R., Pechenizkiy, M.: A brain-inspired algorithm for training highly sparse neural networks. Mach. Learn. 111(12), 4411–4452 (2022)
Ding, X., Hao, T., Tan, J., Liu, J., Han, J., Guo, Y., Ding, G.: Resrep: lossless cnn pruning via decoupling remembering and forgetting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4510–4520 (2021)
Guo, S., Wang, Y., Li, Q., Yan, J.: Dmcp: differentiable markov channel pruning for neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Hussain, H., Tamizharasan, P.S., Rahul, C.S.: Design possibilities and challenges of dnn models: a review on the perspective of end devices. Artif. Intell. Rev. 55(7), 5109–5167 (2022). https://doi.org/10.1007/s10462-022-10138-z
Chen, Y., Zheng, B., Zhang, Z., Wang, Q., Shen, C., Zhang, Q.: Deep learning on mobile and embedded devices: State-of-the-art, challenges, and future directions. ACM Comput. Surv. (2020). https://doi.org/10.1145/3398209
Pullini, A., Rossi, D., Loi, I., Tagliavini, G., Benini, L.: Mr. wolf: an energy-precision scalable parallel ultra low power soc for iot edge processing. IEEE J. Solid-State Circuits 54(7), 1970–1981 (2019). https://doi.org/10.1109/JSSC.2019.2912307
GreenWaves-Technologies: GAP SDK. https://github.com/GreenWaves-Technologies/gap_sdk. Accessed 14 Jun 2023
GreenWaves-Technologies: NNTOOL. https://github.com/GreenWaves-Technologies/gap_sdk/tree/master/tools/nntool. Accessed 14 Jun 2023
GreenWaves-Technologies: AutoTiler. https://greenwaves-technologies.com/manuals/BUILD/AUTOTILER/html/index.html. Accessed 14 Jun 2023
Bruschi, N., Haugou, G., Tagliavini, G., Conti, F., Benini, L., Rossi, D.: Gvsoc: a highly configurable, fast and accurate full-platform simulator for risc-v based iot processors. In: 2021 IEEE 39th International Conference on Computer Design (ICCD), pp. 409–416 (2021). https://doi.org/10.1109/ICCD53106.2021.00071
AMD: Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. https://www.xilinx.com/products/boards-and-kits/ek-u1-zcu102-g.html. Accessed 4 Jan 2024
Li, H., Yue, X., Wang, Z., Chai, Z., Wang, W., Tomiyama, H., Meng, L.: Optimizing the deep neural networks by layer-wise refined pruning and the acceleration on fpga. Computational Intelligence and Neuroscience 2022(1), 8039281 (2022)
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML, Long Beach, California, USA, vol. 97, pp. 6105–6114 (2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594
AMD: Xilinx Vitis AI Profiler. https://github.com/Xilinx/Vitis-AI/tree/3.0/examples/vai_profiler. Accessed 4 Jan 2024
Tang, Y., Wang, Y., Xu, Y., Tao, D., Xu, C., Xu, C., Xu, C.: SCOP: scientific control for reliable neural network pruning, vol. 2020-December (2020). https://www.scopus.com/inward/record.uri?eid=2-s2.0-85104192854 &partnerID=40 &md5=97f5fe30d2d7d7e3d3519412d1ffb44a. Accessed 20 Jun 2023
Wu, P., Huang, H., Sun, H., Liang, D., Liu, N.: Cprnc: channels pruning via reverse neuron crowding for model compression. Comput. Vis. Image Underst. 240, 103942 (2024)
Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., Hu, X.: Pruning from scratch. Proc. AAAI Conf. Artif. Intell. 34, 12273–12280 (2020)
Li, Y., Gemert, J.C., Hoefler, T., Moons, B., Eleftheriou, E., Verhoef, B.-E.: Differentiable transportation pruning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16957–16967 (2023)
Chen, Z., Xu, T.-B., Du, C., Liu, C.-L., He, H.: Dynamical channel pruning by conditional accuracy change for deep neural networks. IEEE Tans. Neural Netw. Learn. Syst. 32(2), 799–813 (2020)
He, Y., Ding, Y., Liu, P., Zhu, L., Zhang, H., Yang, Y.: Learning filter pruning criteria for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2009–2018 (2020)
Funding
Open Access funding provided by Ritsumeikan University. This research is supported by the KIOXIA Corporation.
Author information
Authors and Affiliations
Contributions
Q.L.: conceptualization, methodology, software, data collection manuscript writing and funding acquisition. H.L.: methodology and investigation. L.M.: supervision and resources. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Q., Li, H. & Meng, L. A generic deep learning architecture optimization method for edge device based on start-up latency reduction. J Real-Time Image Proc 21, 116 (2024). https://doi.org/10.1007/s11554-024-01496-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-024-01496-8