An architecture-level analysis on deep learning models for low-impact computations

Li, Hengyi; Wang, Zhichen; Yue, Xuebin; Wang, Wenwen; Tomiyama, Hiroyuki; Meng, Lin

doi:10.1007/s10462-022-10221-5

An architecture-level analysis on deep learning models for low-impact computations

Open access
Published: 26 June 2022

Volume 56, pages 1971–2010, (2023)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

An architecture-level analysis on deep learning models for low-impact computations

Download PDF

Hengyi Li ORCID: orcid.org/0000-0003-4112-7297¹,
Zhichen Wang¹,
Xuebin Yue¹,
Wenwen Wang²,
Hiroyuki Tomiyama³ &
…
Lin Meng⁴

5675 Accesses
23 Citations
3 Altmetric
Explore all metrics

Abstract

Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs

Deep Learning Inference with Dynamic Graphs on Heterogeneous Platforms

Article 12 February 2020

Hardware–Software Codesign of an Adder-Tree Type CNN Accelerator

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

DNNs have been widely applied in various domains with excellent performance, including computer vision (Ozaki and Kuroda 2021; Xie et al. 2021; Yue et al. 2022; Meng et al. 2018), self-driving vehicles (Cardoso et al. 2020; Chernikova et al. 2019), natural language processing (Otter et al. 2021; Chai et al. 2021; Lin et al. 2022), cultural heritage preservation (Chen 2021; Fujikawa etal. 2022), human healthcare (Zhang et al. 2020; Chen et al. 2020; Xiang et al. 2021; Zhang et al. 2021; Pokaprakarn et al. 2022; Patel et al. 2022), etc. In addition, the demands for high-speed and more energy-efficient DNN processing are ever increasing.

The computation of DNNs can be split into two main phases: training and inference Sebastian et al. (2019). The training phase is the process of creating the parameters of a network by learning from the data, such as weights of convolutional layers, and then fine-tuning the model by paring and optimising it. The inference phase is the deployment of the trained neural network model. The demands of deep learning inference increase sharply as the performance of the models is improved significantly and their applications are popularised greatly. Previous research reported that a significant fraction of future computation demands will come from the workloads corresponding to deep learning inference (Park et al. 2018; Jeong et al. 2022).

GPUs have proved their superior capabilities in DNN computations and played an integral role in the development of deep learning for their generic high parallelism capability required for deep learning models (Kim et al. 2017; Mittal and Vaishay 2019). These play an essential role in improving the performance of DNN tasks. CPUs, on the other hand, are a preferred platform for DNN applications, as they continue to evolve around semiconductor design, semiconductor manufacturing technology, and CNN software frameworks (Liu et al. 2019). Modern CPUs include multiple techniques to improve their performance and speed up the processing of data, e.g., thread-level parallelism (Flautner et al. 2000), SIMD, etc. In addition, the one-API Deep Neural Network Library (oneDNN), which is a high-performance library of basic building blocks for deep learning applications, has also been distributed to help developers improve productivity and enhance the performance of deep learning applications on CPU platforms [25]. Furthermore, on x86-64 CPUs, the instruction set architecture (ISA) is detected by oneDNN automatically at runtime, and just-in-time (JIT) code generation technique is adopted to generate optimised code to exploit the latest features of the ISA [25]. With the support of oneDNN, CPU performance for DNNs is greatly enhanced.

For embedded systems, such as mobile devices and robotics, the hardware resources are quite constrained while the application efficiency is highly demanded (Mazzia et al. 2020). Neural networks require massive computational and memory resources, which make them hard to be deployed in such environments, including lightweight architectures. Thus, optimising DNNs has been a major research topic (Chung and Abdelrahman 2020; Lee et al. 2020; Mazumder et al. 2021) and has made a series of progress, such as accelerating multi-DNN workloads based on heterogeneous dataflow Kwon et al. (2021), compressing DNNs with vectorized weight quantization Gong et al. (2021), the delay-aware DNN inference technique Li et al. (2021), the error compensation for low-voltage DNN accelerators Ji et al. (2021), pruning of redundancy for DNNs Chen et al. (2021, 2021); Ahn and Kim (2022); Camci et al. (2022), etc. Although the performance of DNNs has been greatly improved with the support of various strategies including these optimization techniques as well as high-efficiency libraries like oneDNN, there is still much room for further performance improvement. Actually, a previous work (Goel et al. 2020) has revealed that many computations in DNNs are actually low impact, and the feature-map sparsity has not yet to be utilised sufficiently. In particular, CPU-based DNN applications have not been fully researched, and CPU features are still far from being fully exploited for DNN workloads. In this paper, we conduct a systematic research on the DNN inference workload concerning a set of prominent networks with an emphasis on the SIMD CPU architectures. The study includes:

The kernel-instruction level insights of DNNs for the inference process on CPUs with the SIMD feature, including the number of instructions, branches, branch miss predictions, cache misses, etc. As well as the overall characteristics of DNNs, including the network architectures, model size, quantity of parameters, and floating-point operations (FLOPs) of the networks.
The layer-wise time performance with definite time consuming bottlenecks of DNN architectures. The characteristic property is up to the exact workload of the network architecture and is closely related to the features of the hardware.
The layer-wise dynamic activation sparsity which reflects the exact layer-by-layer redundancy, as well as providing a clue for enhancing the sparsity. The feature is completely up to the architectures of the DNN networks.
The fundamental convolution computing mechanism at SIMD instruction level, which provides thorough knowledge of convolution operation on CPUs for researchers.

The main contributions of the paper are providing guidelines at both the hardware and software levels for optimising DNNs, with emphasis on SIMD CPUs as well as widely applicable for other hardware platforms including FPGA, ASIC, and GPUs. The study discovers several interesting findings on the characteristic property of DNN’s inference process and provides insightful references on identifying performance bottlenecks of DNNs. Furthermore, it also presents some positive feedback on designing new hardware architectures using FPGA or heterogeneous computation units to accelerate DNN performance, and the study is widely applicable.

The rest of the paper is organised as follows: In Section 2, we provide the detailed architecture and characteristics of deep learning models studied in the paper at first. Then, we make a detailed discussion on the reason that CPUs outperform GPUs for inference workload of DNNs in many aspects. In Section 3, we introduce the experimental methodology. In Section 4, we analyse the characteristics of CPUs with the SIMD feature. In Section 5, we study the layer-wise time performance which also provides possible time-consuming bottlenecks. In Section 6, we investigate the sparsity with the layer-wise redundancy of DNNs in detail and provide a hint for enhancing the input sparsity of the convolutional layers. In Section 7, we present the principle of convolution operation in CPUs with SIMD instructions. And finally, we conclude this paper in Section 8.

2 Related work

In this section, we first illustrate the precise structures of DNNs and the general features of the state-of-the-art DNN architectures studied in the paper. The aspects in which CPUs outperform GPUs for inference workload are then discussed in depth.

2.1 Deep neural networks

The DNN architectures studied in the research can be simply classified into two categories: modern and lightweight, which refer to seven significant DNN architectures in summary. Modern DNNs include VGGNet (Simonyan and Zisserman 2015), InceptionNet (Szegedy et al. 2015; Ioffe and Szegedy July 2015; Szegedy et al. 2016, 2016), ResNet (He et al. 2016) and DenseNet (Huang et al. 2017); And lightweight DNNs refer to the lightweight architectures of ShuffleNet V2 (Zhang et al. 2018; Ma et al. 2018), MobileNet V2 (Howard et al. 2017) and MnasNet (Tan et al. 2019), which have been specifically designed for resource-constrained devices. There are, of course, a wealth of excellent DNN networks besides the ones listed above, nevertheless, nearly all of them stem from the seven DNN architectures: Res2Net Gao et al. (2021) is on the basis of ResNet; CondenseNet V2 Yang et al. (2021), MS-DenseNet Xie et al. (2021) are on the basis of DenseNet; GhoseNet Han et al. (2020) takes reference from ResNet, MobileNet V2; etc. In addition, as for the backbone of object detection networks, such as YOLO V4 Bochkovskiy et al. (2020), Fast R-CNN Ren et al. (2015), DETER Carion et al. (2020), are all on the basis of these DNN architectures. Hence, the research in the paper focuses on the seven DNN architectures. Since each architecture has multiple models with different versions, different channels or layers, etc., only the representative models are taken for every architecture in the experiments. The detailed architectures of the DNNs are analysed as follows.

2.1.1 AlexNet

AlexNet was the first deep neural network that achieved considerable accuracy. The model won the 2012 Image LSVRC-2012 challenge by reducing the top-5 error rate to 15.3$\%$, which is far lower than the runner-up of 26.2$\%$. AlexNet is similar to LeNet except for its depth with more filters per layer and stacked convolutional layers, as Table 1 shows.

Table 1 Architecture of AlexNet (Krizhevsky et al. 2012)

Full size table

2.1.2 VGGNet

The VGGNet architecture is an improvement over AlexNet as it employs a small receptive field, i.e., multiple [$3\times 3$] size kernel filters one after another instead of large kernel-sized filters (AlexNet uses [$11\times 11$] and [$5\times 5$]). The depth of the architecture is pushed to 10–19 weight layers. VGGNet, as well as InceptionNet, demonstrates that the depth of CNNs with multiple stacked small size kernel filters is a vital factor for network performance. In theory, it enables to learn more complicated patterns by increasing the depth of the network with multiple non-linear layers. The details of the network are presented in Table 2.

Table 2 Architecture of VGGNet (Simonyan and Zisserman 2015)

Full size table

2.1.3 InceptionNet

Although increasing both the depth and width of layers improves the performance of neural networks, there are also drawbacks: the enlarged network means a massive number of parameters and dramatically increases computational resource usage. With consideration of these problems, Inception architectures with multi-level feature extractors were proposed (Szegedy et al. 2015; Ioffe and Szegedy July 2015; Szegedy et al. 2016, 2017).

The computational resources inside a network are optimised in the Inception architectures as its depth and width increase. Fig. 21 and Fig. 22 in Appendix detail the block architectures. The modules act as a “multi-level feature extractor” by computing [$1\times 1$], [$3\times 3$], [$5\times 5$], [$7\times 7$], etc, convolutions within an Inception module, which enables the network to acquire multifold information. The outputs of these filters are then concatenated and passed to the next layer.

In this paper, GoogLeNet and Inception v3 are used to analyse the performance of Inception architecture. However, the GoogLeNet studied in the research differs from that of (Szegedy et al. 2015) with optimization, as Table 3 shows. The architecture of Inception v3 is shown in Table 4.

Table 3 Architecture of GoogLeNet (Szegedy et al. 2015)

Full size table

Table 4 Architecture of Inception v3 (Szegedy et al. 2016)

Full size table

2.1.4 ResNet

As neural networks get deeper, a degradation problem rises: accuracy becomes saturated and degrades rapidly during the training process (He et al. 2016), which indicates that DNNs are more difficult to train. (He et al. 2016) proposed the deep residual learning framework to address the problem by introducing “residual learning”. As shown in Fig. 23 in Appendix, the proposal utilizes the “shortcut connections” that jump over some layers with an identity function to avoid the vanishing gradients.

Table 5 Architecture of ResNets (He et al. 2016)

Full size table

Table 5 depicts the architectures of ResNet. The convolution operations are divided into five stages on the basis of output size and channels. There is a major difference between ResNet34 and ResNet50/101, as Fig. 23 in Appendix shows, in that the 2-layer block of ResNet34 with two [$3\times 3$] convolutions is replaced by a 3-layer bottleneck block with [$1\times 1$] - [$3\times 3$] - [$1\times 1$] convolutions in ResNet50/101. ResNet50/101 utilizes the [$1\times 1$] layers to reduce and then increase the dimensions, leaving the [$3\times 3$] layer a bottleneck with smaller input/output dimensions.

2.1.5 DenseNet

(Huang et al. 2017) proposed the Dense Convolution Network(DenseNet) to strengthen the feature propagation mechanism by exploiting the potential of feature reuse: for each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. The mechanism is detailed in the following equation.

$$\begin{aligned} x_l=H_l([x_0,x_1,...,x_{l-1}]) \end{aligned}$$

(1)

where $x_l$ denotes the feature-maps of the $l^{th}$ layer, $H_l$ is the function of operations such as Convolution, Batch Normalisation, rectified linear function (ReLU) and Pooling. $[x_0,x_1,...,x_{l-1}]$ refers to the concatenation of the preceding layers’ outputs.

Table 6 details the architecture of the network. The method alleviates the problem of exploding gradients, substantially reduces the number of parameters, and strengthens feature propagation.

Table 6 Architecture of DensNets [6]

Full size table

2.1.6 Lightweight architectures

Lightweight architectures are designed for resource-constrained hardware platforms including mobile devices, self-driving cars, and so on. Designing lighter and efficient architectures is increasingly urgency, for the reason that the aforementioned massive computational and memory demands of DNNs make them difficult to be deployed on resource-limited platforms. This section analyses three state-of-the-art lightweight architectures.

MobileNet (Howard et al. 2017), designed by Google, is based on a streamlined architecture which utilizes depth-wise separable convolutions to reduce the model size and complexity. Table 7 shows the detailed architecture with Fig. 24 in Appendix detailing the block structure. The method replaces the computationally expensive convolutional layers with a cheaper depthwise separable convolutional layer, i.e., a [$3\times 3$] depthwise convolutional layer followed by a [$1\times 1$] convolutional layer.

Table 7 Architecture of MobileNet V2 (Howard et al. 2017)

Full size table

ShuffleNet (Zhang et al. 2018; Ma et al. 2018) utilises pointwise group convolution and channel shuffle to optimise the computation costs while maintaining accuracy. Table 8 shows the architecture of the network with Fig. 25 in Appendix details the Block structure. The computation complexity of the costly dense [$1\times 1$] convolutions are reduced by adopting pointwise group convolutions, and the side effects caused by group convolutions are overcame by utilizing the channel shuffle operation. The hyperparameter, i.e., the last two characters of the two models, is called the depth multiplier, which sets the channels used by the convolutional layers.

Table 8 Architecture of ShuffleNet V2 (Ma et al. 2018)

Full size table

MnasNet was proposed as an automated mobile neural architecture search approach to find the best suitable structure(Howard et al. 2017). Fig.1 represents the detailed architecture of the network. It managed to balance between the accuracy and latency, and the architecture incorporates model latency into the main objective to identify a model that achieves a good trade-off.

2.2 CPUs for Inference Workload of DNNs

GPUs play powerful and irreplaceable roles in artificial intelligence for improving the performance and speed of the training process of neural networks with the nature of parallel processing ability. However, GPUs taking SIMD processing units with the Single Instruction Multiple Thread (SIMT) model are exactly for compute-intensive tasks. They bring their parallel computing ability into full play only when processing huge amounts of data and computations, i.e., the training process of DNNs with massive training data, large batch size, and demanding for enormous calculations. In terms of the inference phase, it generally refers to a frequent or continuous process with few data to be processed each time, i.e., inference workload is parallelism limited that cannot be subdivided into parallelizable processes and needs to be executed in sequence. And the application also demands high strict real-time ability such as an auto-driving system. As for the inference workload of DNNs, CPUs outperform GPUs in many aspects at the moment (Park et al. 2018; xxxx 2018). As the hardware platforms for inference workload mainly are data centers and terminal devices, we make a detailed analysis on the two platforms in this section.

2.2.1 Inference workload on data centers

Data centers demand a high degree of fault-tolerant ability and reliability, sufficient computing availability and flexibility, and adopts powerful CPUs. As for the machine learning inference workload on the data center, first, the inference tasks are diverse for different tasks; Second, the types of data and tensor shapes vary as data comes in a steady stream from the internet; Third, most inference workloads demands low latency. For this case, we make a performance comparison on the inference workload of multiple DNN models for processing a random task request based on CPUs and GPUs, i.e., processing one input which simulates the demand for the real time whenever there is a task request. The overall running time for the task and the inference time for DNNs processing the input are taken as the index. Table 9 shows the experimental results. In terms of the inference time, the GPU is far beyond the CPU: the inference time of DNNs on the GPU varies from $3.2\times$ to $54.2\times$ faster than that on the CPU. Nervertheless, the time cost of a task for DNNs, i.e., the running time, is far more than the inference time of the DNN network.

The running time of a task for DNNs consists of the model load time which is a heavy burden in such conditions, the preprocessing for inputs, the model inference time for the network processing the single input, the post processing, and so on. According to the experimental results, the running time of modern DNNs on GPU varies from $1.9\times$ to $7\times$ more than that of CPU; As for the lightweight architectures, especially for the ShuffleNet V2$\_$1.0, the running time on GPU is $16\times$ than that of CPU. It is obvious that the CPU outperforms the GPU in such cases. The data strongly proves the efficiency of CPUs for DNNs in terms of such cases, i.e., the inference workload with parallelism limited and demanding for high real time.

Table 9 Running time for one task request

Full size table

Considering these factors, although data centers are also equipped with GPUs, GPUs are incompetent or inadequate in such cases with limited parallelism. And taking the Facebook data centers for example, it is exactly CPUs that undertake their most inference workloads (Park et al. 2018).

2.2.2 Inference deployment in terminal devices

Terminal devices including robotics, auto-driving systems, mobile devices, embedded devices, edge devices, etc., constitute a significant amount of applications in various domains for DNNs inference tasks. First, there are cases where CPUs are the only available resources for embedded systems such as Raspberry Pi system and Single-Chip Microcomputer system. Second, in terms of mobile devices, GPUs are not the best choice for the reason that not all machine learning computations are supported by GPUs which results in extra CPU-GPU co-processing overheads and is more time consuming (Mittal et al. 2021). (Wu et al. 2019) indicates that only $11\%$ of the smartphones have a GPU that is 3 times more performance than its CPU, more than $40\%$ of the smartphones have CPUs outperform its GPUs. The theoretical peak performance of mobile CPUs has little difference from mobile GPUs (Wu et al. 2019). CPUs provide no less even better performance than GPUs, especially for the lightweight architectures (Kim et al. 2019). Third, the limited memory bandwidth capacities restricts greatly the performance of GPUs in mobile devices for which has no high-bandwidth memory like discrete GPUs. Fourth, GPUs cost more energy, which is also a huge burden for power-limited mobile devices (Fong et al. 2017). In addition, the usability and programmability of mobile GPUs are fragile (Wu et al. 2019).

As a matter of fact, for the machine learning inference at Facebook, only a small fraction of inference currently run on mobile GPUs (Wu et al. 2019), and mobile services occupy no less than 90% of its advertisement revenue [59]. Research on optimizing the inference of DNNs for edge devices also has always been making progress Patel et al. (2022): MobileDets outperform MobileNetV3+SSDLite by 1.7 mAP for the inference latencies on mobile CPU Xiong et al. (2021); de Prado et al. (2021) proposed an automated exploration framework which achieves $4\times$ improvement in performance and over $2\times$ reduction in memory compared to the BLAS floating-point implementation. As a result, CPUs are better choices for inference workloads on terminal devices in many cases.

2.2.3 Summary

As for the inference tasks, considering the data as well as the network loading overhead of GPUs (Ma et al. 2019), data transferring overhead between GPUs and CPUs, the energy consumption of GPUs, the flexibility of inference tasks with limited parallelism, the high latency of GPUs, GPUs are incompetence compared with CPUs. CPUs are more suitable for deep learning inference workload in many cases (Mittal et al. 2021; Kim et al. 2019) and numerous studies have been made to optimize and accelerate DNNs on CPUs (de Prado et al. 2021; Kim et al. 2020; Low et al. 2020; Putro et al. 2021). Thus, in this paper, we focus on the study of DNNs inference performance on CPUs.

3 Premilinary

This section provides the preliminary of the experimental setup. In terms of the machine learning frameworks, there have been multiple representative platforms, such as TensorFlow, PyTorch, Keras, etc. For the reason that the experiments in the study are up to the architecture of the network and have nothing to do with the machine learning frameworks, we simply adopt PyTorch (version 1.7.0a0) in this paper.

The hardware platform is a Dell OptiPlex 700 equipped with an Intel i7-9700 CPU at 3.00GHz and 2$\times$32GB DDR4 main memory. The operating system of our experimental platform is Ubuntu 20.04. The experiments adopt the pretrained networks on the ImageNet ILSVRC 2012 dataset in PyTorch. The training dataset does not influence the analysis of the Characteristics concerning DNNs on a CPU architecture with SIMD in Section 4 and layer-wise time performance in Section 5. The experimental results of Section 4, Section 5, correspond to the inference process of one input which are averaged from 1000 inputs. In the experiments, the input images are unified to be RGB channels and cropped from the centered [$224 \times 224$] for input consistency across all networks. The inputs are also pre-processed with the mean and standard deviation normalization.

4 Characteristics of DNNs on SIMD-CPU Architecture

SIMD instructions have been typically supported by various CPU architectures. For example, Intel’s Streaming SIMD Extensions (SSE), Advanced Vector Extensions (AVX), AVX2, and AVX512, which are also followed quickly by AMD except for AVX512; ARM’s Neon technology; AltiVec which is designed and owned by Apple, IMB, and Freescale Semiconductor. In this section, the characteristics of DNN’s inference process are analysed on the SIMD-CPU architecture by utilizing the profiler tool Perf for Linux 2.6+ based systems to gather the performance counter statistics from it. Table 10 shows the experimental results, which include FLOPs, the number of instructions (num. of inst.), branches, mispredicted branches (branch misses) and cache misses. Furthermore, the inference result indexes are also listed, such as the top-1 accuracy and inference time. The network sizes and parameter numbers are also listed in Table 10 for their importance.

As DNNs differ greatly in exploiting inter-operation parallelism, thread-level parallelism is disabled in the experiments to achieve a fair comparison for different networks.

Table 10 Characteristics of inference process on SIMD-CPU architecture

Full size table

The valuable findings from Table 10 are as follows:

FLOPs: FLOPs indicate the complexity of the models. The index is hardware-independent and depends on the architecture of the network. FLOPs are the main workloads of DNNs and the computing concerning convolution operation is highly optimised and accelerated with SIMD instructions, which is detailed in Section 7. As the table shows, VGG19$\_$BN has relatively fewer layers with only 13–19 layers, but the architecture takes the most inference time of 256.03–384.84 ms with the most instructions of 2577.19–4302.47 million and FLOPs of 11.44–19.77 G. ShuffleNet V2$\_$1.0 takes the least inference time of 9.57 ms with the least instructions of 55.34 million and FLOPs 0.05 G, and the size of ShuffleNet V2$\_$1.0 is also the smallest.

Num. of inst.: The number of instructions for DNNs is not only related to the architecture of the network, but also the characteristics of CPU platforms, i.e., the instruction set supported by the CPU. The paper gives the exact number of instructions for various DNNs in this section. As the instructions are mostly about the FLOPs, then the two indexes are closely correlated with each other. It is obvious that VGGNet architecture takes the most instructions and the lightweight networks take the least.

Branches and branch misses: For CPUs, branch predictors play a critical role in achieving high performance, and the number of branch misses is a vital factor for modern deeply pipelined microprocessor architectures (Meng et al. 2012). Branch instructions and misses are major obstacles to pipelining processes. Models with fewer branch instructions and misses generally perform relatively well. From these models, GoogLeNet has the maximum number of branch misses. Lightweight architectures benefit from fewer branch instructions and misses. However, the branch miss rate of ShuffleNet V2$\_$1.0 is several times higher than that of others: 1.04$\times$–2.19$\times$. Also, MobileNet V2 has the lowest branch miss rate. The two indexes together provide the importance and performance of branch predictors of current CPUs.

The branches per kilo instructions (BPKI) of lightweight neural networks are much higher than those of other models in general. ShuffleNets V2$\_$0.5, which has the least instructions, has the most BPKI with a value of 189.67, which is about ten times more than those of VGG19.

Lightweight models have a larger number of misses per kilo instructions (MPKI) with a value of no less than 2, except for MobileNet V2 with a value of 0.99.

Cache misses: CPU caches play a critical role in mitigating the performance gap between the CPU and main memory (Yu et al. 2001). A cache miss occurs when the requested data cannot be found in the cache and significantly slows down the overall execution process. The index is closely related to the size of the models, the number of parameters, the amount of data processed, etc. Therefore, models with a larger size and more parameters often have more cache misses. As the table shows, lightweight models with smaller sizes and fewer parameters have relatively better performance. For example, VGG19$\_$BN and ShuffleNet V2$\_$1.0 have 22.00 million and 1.22 million cache misses, respectively.

Model size and parameters: The size and parameters are important for DNN models and depend only on the architectures of DNNs. In general, models with smaller sizes and fewer parameters take less computation and memory resources, as well as are always more computationally and memory efficient. However, this is not always the case, the exact architecture of the networks has a big effect on it. For example, Although the model size of MnasNet10 is 24.26% more than that of MobileNet V2, the time cost of MnasNet10 is 17.14% less than that of MobileNet V2. The reason is that the architecture of MnasNet10 is less complex than that of MobileNet V2, i.e., as is detailed in Table 10, MnasNet10 has $8.58\%$ fewer instructions, $4.43\%$ fewer branches, $7.69\%$ fewer branch misses, and $19.93\%$ fewer cache misses than MobileNet, which are up the architecture of the network.

Although AlexNet and VGGNet have much fewer layers, i.e., 5 and 13–16 convolutional layers for AlexNet and VGGNets, respectively, the model size is much larger than that of others, i.e., VGG19 has a size of 548 MB, which is about 18 times larger than that of DenseNet121, which has 120 convolutional layers. Since the three lightweight models are designed mainly for resource-constrained devices, the models have a smaller model size and fewer parameters, i.e., no more than 17 MB and 5 million, respectively, which is far lower than those of other models. It can be seen that the size and parameters of DNNs have little relation to the layers of the model.

The indexes above altogether determine the running time of the networks. This section provides developers with a thorough CPU kernel-instruction level performance analysis of DNNs as well as widely applicable references for various hardware solutions for DNNs. Although the investigation is conducted on SIMD-CPU architectures, the hardware irrelevance indexes, i.e., FLOPs, model size, and parameters, provide widely applicable information for various hardware platforms.

As for the hardware-related indexes, the num. of inst., branches and branch misses, and Cache misses, provide developers with a thorough understanding of various DNNs’ inference phase at a kernel-instruction level for CPU architectures, with detailed performance differences concerning state-of-the-art DNN architectures. The investigation also indicates possible research directions. For example, except for the lightweight architectures, the branch misses are still very high for other DNNs, by making further studies to reduce branch misses which is key for pipeline processes may greatly speed up the inference process. It also needs to be noted that there are major differences between CPUs and other hardware platforms. For example, prevalent GPUs do not have branch predictors, the cache operation of GPUs is quite different from that of CPUs, FPGA, and ASIC are integrated-circuit solutions that are completely different from both CPUs and GPUs. However, the hardware-related indexes also provide more or less hints for other hardware platforms. For example, the current branch prediction technique is general for various tasks, while the specific branch predictor for DNNs may be more efficient. Then the hardware-based accelerator design such as FPGA and ASIC, with the specific branch predictor may effectively improve the inference process by utilizing the dynamic activation sparsity analysed in Section 6.

5 Performance Analysis of Layers

The running time is crucial to evaluate the efficiency of DNNs. The exhaustive time performance analysis of DNN architectures provides a well-grounded knowledge of DNNs concerning time consumption, as well as accurate information for considering the time efficiency to design a high-performance network. A DNN typically consists of multiple layers of artificial neurons that correspond to convolutional layers, non-linear scalar operator layers, down-pooling layers, FC layers, etc. In this section, a layer-wise time performance analysis is conducted on DNNs by utilizing the profiler tool which has been integrated into PyTorch to profile each CPU kernel implementation to gain insights into neural network performance. The PyTorch profiler provides the performance metrics for the training and inference process, including the input shapes and stack traces, kernel activity, and the execution trace.

5.1 AlexNet

Figure 2 details the layer-wise time performance of AlexNet as is shown in Table 1. The five convolutional layers account for about 47$\%$ of the inference time, the last three FC layers take more than 47$\%$ of the inference process with the first FC layer consuming more than 32%. As for the rest layers, the time cost is negligible.

5.2 VGGNet

Figure 3 shows the VGGNets layer-wise time consumption: the three versions differ only in the number of convolutional layers. The convolution operation can be divided into five stages from conv1$\_$x to conv5$\_$x in accordance with the input and output. For the first four stages, the first convolutional layer has fewer input channels compared with the other convolutional layers within the stage, resulting in a reduced time consumption for the layer.

In summary, convolution operations predominate the time cost of the inference process, which account for up to 80$\%$. The layer-wise time cost is consistent for the convolutional layers which have the same input and output shape. For example, the last three convolutional layers of the stage conv3$\_$x and conv4$\_$x, and the four convolutions layers of conv5$\_$x of network VGG19. The time cost of the max-pooling and full-connected layers together is around 22$\%$ of the inference process, and the time cost of other layers is negligible.

Despite concerns that VGGNets demand a longer time for training and require more weights that lead to a heavy architecture as Table 10 shows, the architecture is still a favoured choice on many occasions for its superiority in feature extraction and transfer learning tasks.

5.3 InceptionNet

Figure 4 shows the layer-wise running time performance of GoogLeNet and Inception V3. The patterns shown in the figure represent a certain periodicity in the stacked “Inception modules” architecture. The layer sequence shown in the figure, which corresponds to the block, is shown from left to right and bottom to top. In addition to the convolution operations, the thirteen layers with a percentage higher than 0.5$\%$ in GoogLeNet and the ten layers with that greater than 1$\%$ in Inception v3 are all max-pooling layers. As a result, the pooling operation occupies a large percentage of the inference process, which is quite distinct from the VGGNet architecture.

5.4 ResNet

Figure 5 demonstrates the layer-wise time performance of ResNet34 and ResNet101. Except for the only max-pooling layer for the architecture that has the highest time cost than other layers in terms of a single layer, the time cost of convolution computing is nearly 90$\%$ of the inference process. Also, the time consumption for ResNet architectures changes periodically, which agrees well with the residual block in Fig. 23. The three downsample layers take little time in ResNet34 with no more than 0.5$\%$ of the whole time. As for ResNet101, the convolutional layer with the kernel size of [$1\times 1$] takes about only half the time of the convolutional layer with the kernel size of [$3\times 3$] within the block in terms of single-layer time consumption. The time cost of the remaining layers is subtle and can be negligible.

5.5 DenseNet

Figure 6 shows the layer-wise time performance of DenseNet architecture: the fourth bar in the histogram represents the time consumption of the only max-pooling layer for the architecture with the highest value in terms of single-layer time cost. The four groups of bars from DB1 to DB4 with values increasing linearly refer to the pointwise convolutional layers as their input channels increase linearly. The bars in the four groups with equal values reflect the convolutional layers with [$3\times 3$] size kernels. The regularities of the figure reflect the architecture of the network to a great extent.

5.6 Lightweight Architectures

This section investigates the layer-wise time performance of the three lightweight architectures.

MobileNet: The layer-wise time consumption of MobileNet V2 is shown in Fig. 7, with convolution computing accounting for about 78.65$\%$ of the inference time. In addition to the last FC layer, It needs to be noted that the 10th layer, which is the Batch-Normalization operation in bottleneck 2, takes a relatively high single-layer time consumption. This characteristic is quite different from other DNN architectures.

ShuffleNet: In terms of ShuffleNet V2_1.0 architecture, the layer-wise time cost performance shown in Fig. 8 represents the characteristics corresponding to the block shown in Fig. 25. In terms of a single layer’s time cost, the only max-pooling layer takes the highest percentage with a value of 12.40$\%$ for the network. The last fully-connected layer takes a relatively high percentage in time cost.

MnasNet: The layer-wise time consumption is shown in Fig. 9, with convolution computing accounting for about 79.28$\%$ of the inference time. The time cost of the first five and the last two convolutional layers is higher than that of the other convolutional layers. The last fully-connected layer takes the highest percentage of time cost; The time cost of the other layers is negligible.

5.7 Summary

Table 11 Execution time of inference for one input

Full size table

The overall time cost of convolution operations, as well as the percentage to the model inference time of the aforementioned DNNs, are detailed in Table 11, and the results are also averaged from 1000 inputs. Based on the analysis above, conclusions can be drawn that the convolution operation predominates the inference workloads of DNNs. With the exception of individual models like AlexNet, GoogLeNet, and shuffleNet V2 that have relatively low time consumptions of 42.30$\%$, 50.02$\%$, 50.09$\%$ and 63.33$\%$, respectively, the other networks have a much higher percentage of time consumption from 72.53$\%$ to 90.34$\%$. Besides convolutional layers, max-pooling and fully-connected layers have a relatively high time cost for InceptionNet, ResNet, DenseNet, and so on, and the time consumption for layers such as ReLU layers is negligible. For this discovery, we also make further analysis.

Table 12 Analysis of different computing within DNNs

Full size table

Taking ResNet34 as an example, table 12 shows the FLOPs, Branches, Branch misses of the first four layers, i.e., convolutional layer, BN layer, ReLU layer, and Max-pooling layer. It is obvious that the convolutional layer takes far more FLOPs than others, which is $73 \times$ - $290 \times$ more than that of the other three layers. As a result, convolution operation dominates the time cost of DNNs, and the FC layer which takes similarity operation as convolution computing also consumes more time than others. As for the Max-pooling layer, although FLOPs are little compared with the convolutional layer and the number of branch instructions is only about $1.6 \times$ more than that of the convolutional layer, the branch misses of the Max-pooling layer is about $13 \times$ more than that of the convolutional layer. It is well known that branch instructions and misses are major obstacles to pipelining processes which greatly reduce the efficiency of CPUs. As a result, max-pooling computing takes comparatively much more time in terms of single-layer time cost. In addition, the time cost of other layers such as ReLU and BN layers within DNNs is negligible. The study in the paper especially provides researchers with valuable details on the effect caused by the branch misses, the importance of branch misses for CPUs.

The layer-wise time performance analysed in this section provides thorough details on time consumption of various deep learning architectures, with possible time-consuming bottlenecks for DNNs. Although the study in this section is conducted on CPUs and the analysis is closely related to the hardware characteristics, the experimental results also provide more or less reference for other hardware platforms including GPUs, FPGA, and so on. For the reason that it is exactly the workload of each layer which is up the architecture of DNNs that determines the layer-wise time performance, and the workload is same for all hardware platforms.

6 Dynamic sparsity investigation

Computation intensity is a heavy burden, especially for the hardware with constrained resources such as embedded devices. As has been analysed in Section 5, it is exactly the convolutional operation with a mass of computation that takes up the most workload of DNNs. As a result, reducing the intensity of convolutional computation has been an important research topic. The innate characteristic of DNNs, i.e., sparsity, which is defined as the fraction of zeros in the parameters and input activation matrices in this paper, reflects that DNNs are heavily over parameterized and have massive redundant computations which are meaningless for the network. Thus, sparsity emerges as a quite effective solution for the reason that the operations concerning sparsity are meaningless and can be removed to greatly improve the speed of the network (Guo et al. 2020; Gong et al. 2020).

The sparsity can be simply classified into two categories (Zhou et al. 2018): static and dynamic. Static sparsity means the sparsity in the parameters of the DNN models (i.e., weight sparsity), while dynamic sparsity is introduced when the value of the neuron elements becomes 0 and does not contribute anything to the follow-up neurons. For the reason that the static sparsity is training phase determined characteristics with various extra algorithms to sparse the parameters that belong to the research of training process. Then, in this paper, we focus on the analysis of the layer-wise dynamic activation sparsity of the convolutional layer inputs. The experimental results in this section are averaged from 100 inputs. It needs to be noted that the input sparsity of the first convolutional layer is zero, i.e., the raw input data of the network has no sparsity.

6.1 AlexNet and VGGNet

Figures 10 and 11 show the overall trend of sparsity for plain models of AlexNet and VGGNet. In terms of AlexNet, the input sparsity of the last two convolutional layers is up to about 80$\%$. As for VGGNets, the input sparsity of the convolutional layers that have the max-pooling operation before the layer has an obvious decrease compared with the former convolutional layer. Also, for the sparsity of the last several convolutional layers, the input sparsity reaches a high value of no less than 80$\%$.

6.2 InceptionNet

Figure 12 shows the details of GoogLeNet and Inception v3. For the “Inception module” stacked architectures, the sparsity of Inception-Net represents the periodicity along with the Inception blocks. In terms of GoogLeNet, the 9th, 15th, 21st, 27th, 33rd, 39th, 44th, and 50th convolutional layers, which correspond to the [$1\times 1$] convolutional layer after the max-pooling layer within the Inception block shown in Fig. 21, have an obvious decrease in sparsity. As for Inception v3, the model has the same pattern as that of GoogLeNet. The overall sparsity of Inception-Net keeps rising on the whole as the layers get deeper. The average input sparsity of GoogLeNet and that of Inception v3 is no less than 50$\%$ and 60$\%$, respectively.

6.3 ResNet

Figure 13 shows the details of ResNet. In terms of ResNet-34, the figure shows a feature that the input sparsity of the first [$3\times 3$] convolutional layer is much less than that of the second one within the residual block, which is shown in Fig. 23, because there is a ReLU function before the second convolutional layer but not for the first convolutional layer within the block. The average input sparsity of the convolutional layers is no less than 50$\%$.

As for ResNet-101, the input sparsity of the [$3\times 3$] convolutional layer is much less than that of the two [$1\times 1$] convolutional layers. For example, the input sparsity of the 35th convolutional layer is about 30 $\%$ less than that of the two [$1\times 1$] convolutional layers within the residual block-Fig. 23. The average input sparsity of the convolutional layers is no less than 55 $\%$.

6.4 DenseNet

The increasing and decreasing changes of sparsity shown in Fig. 14 reflect the “bottleneck” architecture of the network: the layer structure of BatchNorm(BN)-ReLU-Conv([$1\times 1$])-BN-ReLU-Conv([$3\times 3$]). The results show a feature for DenseNet that the input sparsity of the [$1\times 1$] convolutional layer is obviously less than that of the [$3\times 3$] convolutional layer for the convolutional layers within the Dense Blocks. In general, the sparsity keeps rising as the layers get deeper, and the percentage can be up to no less than 90 $\%$ for the last several [$3\times 3$] convolutional layers.

6.5 lightweight architectures

This section investigates the input sparsity of convolutional layers on the three lightweight architectures.

MobileNet: Fig. 15 shows the dynamic sparsity of MobileNet V2. For the reason that there is no ReLU operation before nearly one-third of convolutional layers, i.e., the last [$1\times 1$] convolutional layers of bottleneck1 and bottleneck 2, and the last convolutional layer of the network, the sparsity of these convolutional layers is zero. The average dynamic sparsity of the other two-thirds is about 40$\%$. Especially, it needs to be noted that the input sparsity of the second to last convolutional layer is only about 8%.

ShuffleNet: Fig. 16 shows the results of ShuffleNet V2. The sparsity of the right branch [$3\times 3$] convolution of Block B, which is shown in Fig. 25, i.e., the 4th, 19th, and 44th convolutional layers, has an obvious increase of about 30 $\%$ compared with that of the former convolutional layer within the block. Since there is no ReLU function before a number of convolutional layers, i.e., the last [$1\times 1$] convolutional layers of Block A and Block B, the input sparsity of about one-third of convolutional layers is zero. The average sparsity of the other two-thirds is about 50 $\%$.

MnasNet: Fig. 17 shows the results of MnasNet 1.0. For the first repeat cycle MBConv-A and MBConv-B, the sparsity of the corresponding [$3\times 3$] and [$5\times 5$] convolutional layer is higher than the following [$1\times 1$] convolutional layer within the block; The other layers contrast this. As for MBConv-C and MBConv-D, except for the 26th convolutional layer, the sparsity of the corresponding [$3\times 3$] and [$5\times 5$] convolutional layers is higher than the following [$1\times 1$] convolutional layer within these blocks.

6.6 Sparsity enhancement

The sparsity in DNNs is largely attributed to the activation function, ReLU (Agarap 2018). In this section, we make an explore to enhance the dynamic sparsity by utilizing the ReLU function for the inference phase after the networks are trained.

The activation function is an extremely important feature of artificial neural networks as it helps a model account for interaction and non-linear effects. This enables DNNs to learn complex patterns from the data. All of the DNNs analysed in the paper use ReLU as the default activation function. The function can be described in the following equation.

$$\begin{aligned} f(x)=\left\{ \begin{aligned} 0\,&for&x< threshold\\ x\,&for&x>= threshold \\ \end{aligned} \right. \end{aligned}$$

In the formula, the threshold is equal to 0.

For the sake of convenience to analyse the effect of the threshold value, VGGNets were used to carry out the experiment for their simplicity. First, the distribution of the VGG16 convolutional layers’ input data was investigated. From the surveys, it can be concluded that the trend of the data is consistent. As the length is limited, only three figures of different inputs about VGG16 are shown in Fig. 18, and both VGG13 and VGG19 share the same patterns.

Figure 18 shows that the numerical distribution of input is quite different for the first convolutional layer, of which the data is the raw input of the network. However, the distribution of input value tends to be consistent from the second convolutional layer to the last one. Then, the threshold of ReLU layers is adjusted in accordance with Fig. 18, and the following experiments are conducted. First, we investigate the corresponding convolutional layer-wise dynamic sparsity and then test the precision variation. The experimental results are shown in Table 13, which is averaged from 100 inputs.

A. Thresholds of ReLUs from the 1st ReLU layer to the 12th ReLU layer are changed to 0.1, others remain unchanged.

B. Thresholds of ReLUs from the 1st ReLU layer to the 12th ReLU layer are changed to 0.5, others remain unchanged.

C. Thresholds of ReLUs from the 7th ReLU layer to the 12th ReLU layer are changed to 2, others remain unchanged.

Table 13 Layer-wise dynamic sparsity of VGG16

Full size table

According to the experimental results shown in Table 13, zero maybe is not the best threshold value for the accuracy of the models as the accuracy increased from 72.2$\%$ with a zero threshold to 72.4$\%$ in scheme B, and the sparsity slightly increased. At the same, increasing the threshold value of trained DNNs brings more or less accuracy loss. With an acceptable accuracy loss, the dynamic sparsity can be significantly increased for a number of convolutional layers by appropriately adjusting the threshold of ReLU, such as conv1 being increased by about 34$\%$ in scheme B, and the last six convolutional layers increased by about 6$\%$ in scheme C. This indicates a direction for both improving the accuracy and increasing the sparsity of DNNs.

6.7 Summary

The data shows that the activation sparsity of convolutional layers is quite high with specific characteristics in accordance with their respective architectures. Even for lightweight architectures which have been highly optimised and compressed, two-thirds of the average activation sparsity is no less than 40$\%$. At the same time, although the dynamic sparsity is closely related to the dataset, previous studies have shown that the dynamic sparsity for other datasets such as CIFAR100 and Kuzushiji is no less than the percentage concerning ILSVRC 2012 studied in this paper as long as the network is fully trained(Li et al. 2021). Thus, the findings are widely applicable to various datasets.

In general, the experimental results in this section convincingly demonstrate that DNNs have an overwhelming amount of redundant computations concerning the dynamic activation sparsity for the reason that the computations concerning the sparsity are absolutely meaningless and thus can be omitted. This indicates that DNNs have significant room for improvement. FPGA provides a potential solution for this issue. For example, (Guan et al. 2020) designed the accelerator Crane with the load-balancing method to address various types of sparsity irregularities in DNNs, and achieves 1.27$\times$ 1.88$\times$ speedup compared with the state-of-the-art accelerator baselines. So, although the high percentage of dynamic sparsity makes it hard to be fully utilized for its irregularity but also leaves much room to be further studied for performance improvement, especially for the SIMD-CPU architecture. The dynamic activation sparsity analysed in this section provides valuable references with detailed layer-wise redundancy for optimising and accelerating DNNs.

7 Convolution operation on CPU

As is analysed that convolutional operations dominate the time consumption of DNNs in Section 5, we make a thorough analysis on the principle of convolutional computing upon modern CPUs to provide potential solutions for accelerating DNNs by utilising the sparsity analysed in Section 6.

In the contemporary era, CPUs can achieve peak performance through SIMD extensions. With the support of SIMD instructions, a single processor performs the same operation simultaneously on multiple pieces of data, by loading multiple values into wide vector registers(e.g. YMM registers) to be processed together (Cardoso et al. 2017). For example, AVX2 uses sixteen YMM registers to hold and perform simultaneous operations on eight 32-bit single-precision or four 64-bit double-precision floating-point numbers per CPU cycle. The more advanced AVX-512 handles up to 512-bit data. In addition, these instruction sets introduce fused multiply-accumulate(FMA) operations (Nasiri et al. 2014), which execute one vectorized multiplication and then accumulate the results to another vector register in the same CPU cycle. SIMD sets have been easily incorporated into C and C++ through the use of API extension sets referred to as intrinsic functions or intrinsics, which provide access to SIMD sets and are suitable for the most commonly-used C and C++ compilers [82]. Thread-level parallelism is explicitly represented to improve the overall processor performance through the use of multiple threads of execution. These techniques have been utilised by MKL-DNN to optimise DNN operations.

Computation and memory efficiency are two vital factors for hardware to process data; DNNs are no exception. In this section, as the convolution computing predominates DNN operations, the operation and the corresponding data format that oneDNN utilises on CPUs are detailed.

7.1 Data format

Computations are mostly about data: reading, processing, storing, etc. In the process of convolution, feature maps and weights/filters require appropriate formats in memory to leverage SIMD instruction sets. This section represents the data formats for DNNs.

Figure 19 shows the data format of feature maps. The activations consist of four dimensions: batch(N), channels(C), height(H), and width(W). The format of the feature maps stored in memory is up to the orders of the four dimensions. NHWC, NCHW, and CHWN are commonly-used formats and can be created with MKL-DNN’s built-in functions [25]. In the paper, activations are laid out in memory with the NCHW format, which means the overall dimensionality of the input and output are 4D tensors with batch as the outermost and width as the innermost dimension of the data. In addition, it needs to be noted that the batch number is zero for the inference phase. To get the data in memory, it is necessary to define how to map the 4D-tensor to a 1D tensor via an offset function that takes a logical index (n,c,h,w) as an input and returns the address displacement to where the value is located[25]. Taking the NCHW format as example, the function can be described as:

$$\begin{aligned} off\,set\_nchw(n,c,h,w) = n\times C\times H\times W + c\times H \times W + h\times W + w \end{aligned}$$

(2)

where w denotes the inner-most dimension: the corresponding elements share the same indices of n, c and h, and are adjacent in memory; In addition, their indexes of w would be different by one; This is always true for only non-border elements; In terms of n, it is the outermost dimension here: it needs to jump over the whole image size C*H*W to take the same pixel (c,h,w) but only on the next feature.

7.2 Convolution computing

Convolution operations applied in DNNs correspond to a series of multiplications and accumulations. Taking the first convolutional layer of VGGNet as an example, the key AVX2 instructions-AT &T format-to perform convolution computing is shown in Fig. 20. To fully utilise the potential of SIMD instructions, the process can be simply divided into three steps, and the basic computing unit calculates eight elements that correspond to the same location of eight output channels simultaneously in one cycle. The SIMD instructions load feature-map elements and weights to vector registers, and then implement the multiplication and accumulation operations with FMA instructions.

S1: Using the “vbroadcastss m32, $\%$ymm1” instruction to broadcast the feature-map element (e.g., R11) in the memory address to the eight locations in the YMM register.

S2: Using the “vmovups $\%$ymm2/m256, $\%$ymm1” to move eight weights (the eight values corresponding to the eight filters: r111 to r811) in the memory address to the eight locations in the YMM register.

S3: Using the “vfmadd231ps $\%$ymm3/m256, $\%$ymm2, $\%$ymm1” instruction to multiply the eight packed single-precision floating-point values in the first source operand and the second source operand, adding the intermediate result to the third source operand, performing the rounding operation, and storing the result values to the third source operand, i.e., the destination operand.

The process from S1 to S3 will be iterated 27 times to get the final results from m1 to m8, for the reason that the input channel is RGB and the kernel size is [$3 \times 3$].

The analysis in this section provides in-depth knowledge about the convolution operation for DNNs as well as the related data format that fits the computing upon modern CPUs. In terms of SIMD CPU architectures, (Li et al. 2021) proposed to skip the meaningless computing concerning dynamic sparsity with branch commands at SIMD instruction level. However, due to the branch misses brought by the proposal, the method simply achieved the performance improvement by $10.26\%$ on dataset ImageNet for certain convolutional layers in the inference process. Despite the computing architecture being for SIMD-CPU architecture, it also provides significant references for hardware accelerator design on hardware level, for example, FPGA and ASIC based design.

8 Conclusion

The paper conducts a systematic, comprehensive, and thorough inference workload study on the prominent DNN architectures. First, the kernel-level analysis, instruction-level convolution computing mechanism, and the corresponding data structure, provide in-depth knowledge for the characteristics of DNNs on SIMD CPUs. This indicates promising research directions with precise critical points for the future development of DNN accelerators, especially the prospective hardware solutions like FPGA-based design and heterogeneous accelerators. Second, the layer-wise time performance of DNNs offers exhaustive details about the time cost of each layer and the possible time-consumption bottlenecks of the networks. Third, the layer-wise dynamic activation sparsity provides researchers with precise layer-wise redundancy of DNN networks; The data also proves conclusively that DNNs are heavily over parameterized with massive redundant computations. In addition, although the research is conducted on a SIMD-CPU architecture with PyTorch, the findings are widely applicable to all machine learning frameworks and various hardware platforms including FPGA, GPUs, and ASIC. To summarize, the research offers significant references for optimising and accelerating the inference workload of DNNs at the instruction level by improving the time-consuming bottlenecks and eliminating the redundant computations concerning the feature-map sparsity at both the hardware and software levels.

References

Agarap Abien Fred (2018) Deep learning using rectified linear units (relu). CoRR, abs/1803.08375
Ahn B, Kim T (2022) Deeper weight pruning without accuracy loss in deep neural networks: Signed-digit representation-based approach. IEEE Trans Comput Aided Des Integr Circuits Syst 41(3):656–668
Article Google Scholar
Bochkovskiy Alexey, Wang Chien-Yao, Liao Hong-Yuan Mark (2020) Yolov4: Optimal speed and accuracy of object detection,
Camci E, Gupta M, Wu M, Lin J (2022) Qlp: Deep q-learning for pruning deep neural networks. IEEE Transactions on Circuits and Systems for Video Technology
Cardoso VB, Oliveira AS, Forechi A, Azevedo P, Mutz F, Oliveira-Santos T, Badue C, De Souza AF (2020) A large-scale mapping method based on deep neural networks applied to self-driving car localization, pp 1–8
Cardoso João M.P, Coutinho José Gabriel F, Diniz Pedro C (2017) Chapter 2 - high-performance embedded computing. In Embedded Computing for High Performance, pages 17–56
Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, Zagoruyko Sergey (2020) End-to-end object detection with transformers. In: Computer Vision – ECCV 2020, pages 213–229, Cham
Chai L, Jun D, Liu Q-F, Lee C-H (2021) A cross-entropy-guided measure (cegm) for assessing speech recognition performance and optimizing dnn-based speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 29:106–117
Article Google Scholar
Chen Z, Ting-Bing X, Changde D, Liu C-L, He H (2021) Dynamical channel pruning by conditional accuracy change for deep neural networks. IEEE Trans Neural Networks Learn Syst 32(2):799–813
Article Google Scholar
Chen K, Yao L, Zhang D, Wang X, Chang X, Nie F (2020) A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans Neural Networks Learn Syst 31(5):1747–1756
Article Google Scholar
Chen Z (2021) Application of artificial intelligence in the inheritance and innovation of excellent traditional chinese culture. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI) pp 665–668, Kunming,China, 09
Chen T, Ji B, Ding T, Fang B, Wang G, Zhu Z, Liang L, Shi Y, Yi S, Tu X (2021) Only train once: A one-shot neural network training and pruning framework. In: Advances in Neural Information Processing Systems, volume 34, pages 19637–19651. Curran Associates, Inc.,
Chernikova A, Oprea A, Nita-Rotaru C, Kim BG (2019) Are self-driving cars secure? evasion attacks against deep neural networks for steering angle prediction. In: Security and Privacy Workshops (SPW), pp 132–137
Chung S-H, Abdelrahman TS (2020) Optimizing opencl kernels and runtime for dnn inference on fpgas. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 151–154, New Orleans, LA, USA
Facebook Investor Relations. “facebook reports first quarter 2018 results. 2018
Flautner K, Uhlig R, Reinhardt S, Mudge T (2000) Thread-level parallelism and interactive performance of desktop applications. In: ACM SIGARCH Computer Architecture News, vol 28, pp 129–138, Online, December
Fong D, Motamedi M, Ghiasi S (2017) Machine intelligence on resource-constrained iot devices: The case of thread granularity optimization for cnn inference. ACM Transactions on Embedded Computing Systems 16:1–19
Google Scholar
Fujikawa Y, Li H, Yue X, Aravinda CV, Amar PG, Meng L (2022) Recognition of oracle bone inscriptions by using two deep learning models. Int J Digit Hum
Gao S-H, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr P (2021) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell 43(2):652–662
Article Google Scholar
Goel A, Tung C, Lu Y-H, Thiruvathukal GK (2020) A survey of methods for low-power deep learning and computer vision. In: 2020 IEEE 6th World Forum on Internet of Things (WF-IoT), pp 1–6
Gong C, Chen Y, Ye L, Li T, Hao C, Chen D (2021) Vecq: minimal loss dnn model compression with vectorized weight quantization. IEEE Trans Comput 70(5):696–710
Article MathSciNet MATH Google Scholar
Gong Zhangxiaowen, Ji Houxiang, Fletcher Christopher W, Hughes Christopher J, Baghsorkhi Sara, Torrellas Josep (2020) Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 796–810
Guan Y, Sun G, Yuan Z, Li X, Ningyi X, Chen S, Cong J, Xie Y (2020) Crane: Mitigating accelerator under-utilization caused by sparsity irregularities in cnns. IEEE Trans Comput 69(7):931–943
Article MATH Google Scholar
Guo Cong, Hsueh Bo Yang, Leng Jingwen, Qiu Yuxian, Guan Yue, Wang Zehuan, Jia Xiaoying, Li Xipeng, Guo Minyi, Zhu Yuhao (2020) Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15,
Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) Ghostnet: More features from cheap operations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1577–1586, Seattle, WA, USA, jun
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778, Los Alamitos, CA, USA, June
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. In: CoRR, abs/1704.04861
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2261–2269, Los Alamitos, CA, USA, July
Intel. Intel oneAPI Programming Guide, 2020
Intel. Intel 64 and IA-32 ArchitecturesSoftware Developer’s Manual, (2021)
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. The 32nd International Conference on Machine Learning (ICML). volume 37. Lille, France, pp 448–456
Jeong EJ, Kim J, Tan S, Lee J, Ha S (2022) Deep learning inference parallelization on heterogeneous processors with tensorrt. IEEE Embed Syst Lett 14(1):15–18
Article Google Scholar
Ji D, Shin D, Park J (2021) An error compensation technique for low-voltage dnn accelerators. IEEE Trans Very Large Scale Integr Syst 29(2):397–408
Article Google Scholar
Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah,Daya Shanker Khudia, James Law, Parth Malani, Andrey Malevich, NadathurSatish, Juan Miguel Pino, Martin Schatz, Alexander Sidorov, ViswanathSivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril,Dmytro Dzhulgakov, Kim M. Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, VijayRao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy.Deep learning inference in facebook data centers: Characterization,performance optimizations and hardware implications.CoRR, 2018
Kim Y, Kong J, Munir A (2020) Cpu-accelerator co-scheduling for cnn acceleration at the edge. IEEE Access 8:211422–211433
Article Google Scholar
Kim H, Nam H, Jung W, Lee J (2017) Performance analysis of cnn frameworks for gpus. In: The 2017 IEEE International Symposium on Performance Analysis of Systems and Software, pp 55–64, Santa Rosa, CA, USA, April
Kim Youngsok, Kim Joonsung, Chae Dongju, Kim Daehyun, Kim Jangwoo (2019) layer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. pages 1–15, 03
Krizhevsky Alex , Sutskever Ilya, Hinton Geoffrey E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, volume 25, Lake Tahoe, Nevada, USA, December . Curran Associates, Inc
Kwon H, Lai L, Pellauer M, Krishna T, Chen Y-H, Chandra V (2021) Heterogeneous dataflow accelerators for multi-dnn workloads. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp 71–83
Lee G, Park H, Ryu S, Lee H-J (2020) Acceleration of dnn training regularization: Dropout accelerator. In: ICEIC2020, pp 1–2
Li J, Liang W, Li Y, Xu Z, Jia X, Guo S (2021) Throughput maximization of delay-aware dnn inference in edge computing by exploring dnn model partitioning and inference parallelism. IEEE Transactions on Mobile Computing, pp 1
Li Hengyi, Wang Zhichen, Yue Xuebin, Wang Wenwen, Hiroyuki Tomiyama, Meng Lin (2021) A comprehensive analysis of low-impact computations in deep learning workloads. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI, pages 385–390, New York, USA, June
Lin Y-C, Cheng Yu, Hsu Y-T, Szu-Wei F, Tsao Yu, Kuo T-W (2022) Seofp-net: Compression and acceleration of deep neural networks for speech enhancement using sign-exponent-only floating-points. IEEE/ACM Tran Audio Speech Lang Process 30:1016–1031
Article Google Scholar
Liu Y, Wang Y, Yu R, Li M, Sharma V, Wang Y (2019) Optimizing CNN model inference on cpus. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, pages 1025–1040, Berkeley, CA, USA, July
Low Cheng-Yaw, Park Jaewoo, Teoh Andrew Beng-Jin (2020) Stacking-based deep neural network: Deep analytic network for pattern classification. IEEE Transactions on Cybernetics, 50(12):5021–5034
Ma N et al (2018) Shufflenet V2: practical guidelines for efficient CNN architecture design. The 15th European Conference on Computer Vision (ECCV). volume 11218. Munich, Germany, pp 122–138
Ma Yun, Xiang Dongwei, Zheng Shuyu, Tian Deyu, Liu Xuanzhe (2019) Moving deep learning into web browser: How far can we go? pages 1234–1244, 05
Mazumder AN, Meng J, Rashid H-A, Kallakuri U, Zhang X, Seo J-S, Mohsenin T (2021) A survey on the optimization of neural network accelerators for micro-ai on-device inference. IEEE J Emerg Sel Top Circ Syst 11(4):532–547
Article Google Scholar
Mazzia V, Khaliq A, Salvetti F, Chiaberge M (2020) Real-time apple detection system using embedded systems with hardware accelerators: An edge ai application. IEEE Access 8:9102–9114
Article Google Scholar
Meng L, Hirayama T, Oyanagi S (2018) Underwater-drone with panoramic camera for automatic fish recognition based on deep learning. IEEE Access 6:17880–17886
Article Google Scholar
Meng Lin et al. (2012) A novel branch predictor using local history for miss-prediction. The 2012 International Conference on Computer Design, pages 77–83
Mittal S, Vaishay S (2019) A survey of techniques for optimizing deep learning on gpus. J Syst Architect 99:101635
Article Google Scholar
Mittal Sparsh, Rajput Poonam, Subramoney Sreenivas (2021) A survey of deep learning on cpus: Opportunities and co-optimizations. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21
Nasiri Nasibeh, Segal Oren, Margala Martin (2014) Modified fused multiply-accumulate chained unit. In: 2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS), pages 889–892,
Otter DW, Medina JR, Kalita JK (2021) A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Networks Learn Syst 32(2):604–624
Article MathSciNet Google Scholar
Ozaki R, Kuroda Y (2021) Ekf-based real-time self-attitude estimation with camera dnn learning landscape regularities. IEEE Robot Autom Lett 6(2):1737–1744
Article Google Scholar
Patel K, Mistry C, Mehta D, Thakker U, Tanwar S, Gupta R, Kumar N (2022) A survey on artificial intelligence techniques for chronic diseases: Open issues and challenges. Artif Intell Rev, pp 1–44, 06
Patel Keyur, Mistry Chinmay, Mehta Dev, Thakker Urvish, Tanwar Sudeep, Gupta Rajesh, Kumar Neeraj (2022) Ai on the edge: a comprehensive review. Artificial Intelligence Review, 03
Pokaprakarn T, Kitzmiller RR, Moorman R, Lake DE, Krishnamurthy AK, Kosorok MR (2022) Sequence to sequence ecg cardiac rhythm classification using convolutional recurrent neural networks. IEEE J Biomed Health Inform 26(2):572–580
Article Google Scholar
de Prado M, Mundy A, Saeed R, Denna M, Pazos N, Benini L (2021) Automated design space exploration for optimized deployment of dnn on arm cortex-a cpus. IEEE Trans Comput Aided Des Integr Circuits Syst 40(11):2293–2305
Article Google Scholar
Putro Muhamad Dwisnanto, Kurnianggoro Laksono, Jo Kang-Hyun (2021) High performance and efficient real-time face detector on central processing unit based on convolutional neural network. IEEE Transactions on Industrial Informatics, 17(7):4449–4457
Ren Shaoqing, He Kaiming, Girshick Ross, Sun Jian (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.,
Sebastian A, Boybat I, Dazzi M, Giannopoulos I, Jonnalagadda V, Joshi V, Karunaratne G, Kersting B, Khaddam-Aljameh R, Nandakumar SR, Petropoulos A, Piveteau C, Antonakopoulos T, Rajendran B, Le Gallo M, Eleftheriou E (2019) Computational memory-based inference and training of deep neural networks. In: 2019 Symposium on VLSI Technology, pages T168–T169, Kyoto, Japan, June
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: The 3rd International Conference on Learning Representations(ICLR), San Diego, CA, USA, May
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, Boston, MA, USA, June
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826, Las Vegas, NV, USA, June
Szegedy C, Ioffe S, Vanhoucke V (2016) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4278–4284, San Francisco California USA, Feburary
Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, Alemi Alexander A (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 4278-4284. AAAI Press,
Tan M, Chen B, Pang R, Vasudevan V, Le QV (2019) Mnasnet: Platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2815–2823, Long Beach, CA, USA, June
Wu Carole-Jean et al. (2019) Machine learning at facebook: Understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331–344, Washington, DC, USA, February
Xiang Yu, Kang C, Guttery DS, Kadry S, Chen Y, Zhang Y-D (2021) Resnet-scda-50 for breast abnormality classification. IEEE/ACM Trans Comput Biol Bioinf 18(1):94–102
Article Google Scholar
Xie J, He N, Fang L, Ghamisi P (2021) Multiscale densely-connected fusion networks for hyperspectral images classification. IEEE Trans Circuits Syst Video Technol 31(1):246–259
Article Google Scholar
Yang L, Jiang H, Cai R, Wang Y, Song S, Huang G, Tian Q (2021) Condensenet v2: Sparse feature reactivation for deep networks. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3568–3577
Yu Y, Beyls K, D’Hollander EH (2001) Visualizing the impact of the cache on program execution. In: Proceedings Fifth International Conference on Information Visualisation, pages 336–341, London, UK, July
Yue X, Li H, Shimizu M, Kawamura S (2022) and Lin Meng. A deep learning-based object detection algorithm for empty-dish recycling robots. Machines, Yolo-gd
Yunyang Xiong, Hanxiao Liu, Suyog Gupta, Berkin Akin, Gabriel Bender, Yongzhe Wang, Pieter-Jan Kindermans, Mingxing Tan, Vikas Singh, and Bo Chen. Mobiledets: Searching for object detection architectures for mobile accelerators. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3824–3833, Nashville, TN, USA, 2021
Zhang X, Han Z, Shangguan H, Han X, Cui X, Wang A (2021) Artifact and detail attention generative adversarial networks for low-dose ct denoising. IEEE Trans Med Imaging 40(12):3901–3918
Article Google Scholar
Zhang D, Yao L, Chen K, Wang S, Chang X, Liu Y (2020) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044
Article Google Scholar
Zhang , ZX, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6848–6856, Salt Lake City
Zhou X et al. (2018)Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In: Proc. 51st IEEE/ACM Int. Symp. Microarchitecture, pages 15–28

Download references

Author information

Authors and Affiliations

Department of Electronic and Computer Engineering, Ritsumeikan University, Kyoto, Japan
Hengyi Li, Zhichen Wang & Xuebin Yue
Department of Computer Science, University of Georgia, Athens , USA
Wenwen Wang
College of Science and Engineering, Ritsumeikan University, Shiga, Japan
Hiroyuki Tomiyama
College of Science and Engineering, Ritsumeikan University, 1-1-1 Noji-higashi, Kusatsu, Shiga, 525-8577, Japan
Lin Meng

Authors

Hengyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhichen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuebin Yue
View author publications
You can also search for this author in PubMed Google Scholar
Wenwen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Tomiyama
View author publications
You can also search for this author in PubMed Google Scholar
Lin Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Meng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extension of “A Comprehensive Analysis of Low-Impact Computations in Deep Learning Workloads”, which has been published in GLSVLSI ’21: ACM Great LakesSymposium on VLSI, June 22-25, 2021, Virtual Event.USA ACM, New York. The conference paper simply makes an overall investigation concerning VGG19 _BN, Inception V3, ResNet101, DenseNet121, ShuffleNet V2_1.0, and an instruction-level optimization which is not involved in this paper. The paper extends the investigation to thorough and insightful studies to emphasise the main findings which are valuable for optimising and accelerating DNNs with more comprehensive deep learning architectures including AlexNet, VGGNet, GoogLeNet, ResNet34, DenseNet161, DenseNet169, MnasNet, MobileNet. The studies are also extended with detailed characteristics and architectures of various networks, the aspects that CPUs outperform GPUs for inference workloads. a hint for enhancing dynamic sparsity, detailed principle of convolution operation on CPU. In total, more than 80% percent studies of this paper are extended contents. The codes related to the experiments of the study are available to help potential audiences for reproducing and verifying the tests: https://github.com/lihengyi-ai/Analysis-on-Deep-Learning-Models-for-Low-Impact-Computations Data Availability Statements: All data generated or analysed during this study are included in this published article (and its supplementary information files).

A Appendix: Block architectures of DNNs

Inception v1 block: As Figure 21 shows, it consists of four branches with two different sizes of filters. The main feature is that block adopts the dimension reduced approach, i.e., the number of input channels is limited by utilising an [$1\times 1$] convolution before the [$3\times 3$] convolution. Finally, the outputs of the four branches are concatenated and sent to the next block.

Inception v3 block: As Figure 22 shows, there are five different kinds of inception modules. In terms of Inception A, it features that the module factorizes the [$5\times 5$] convolution to two [$3\times 3$] convolution operations to improve computational speed. The [$5\times 5$] convolution is 2.78 times more expensive than a [$3\times 3$] convolution. In terms of Inception C, D, they factorize convolutions of filter size [$n\times n$] to a combination of [$1\times n$] and [$n\times 1$] convolutions. As for Inception E, the filter banks in the module are expanded wider instead of deeper to avoid the excessive reduction in dimensions which leads to loss of information.

Residual block: As Figure 23 shows, the module features by utilising the “shortcut connections” to address the degradation and the saturation or even degrading rapidly of the gradient of DNNs as the network gets deeper. The left module is for ResNet34, and the right module is for ResNet101.

MobileNet V2: As Figure 24 shows, the network adopts two kinds of bottlenecks. The bottleneck features by adopting depthwise convolution (Dwise) for the [$3\times 3$] convolution to reduce the model size and complexity. The [$3\times 3$] convolution is used to increase/decrease the dimensions.

ShuffleNet Block: As Figure 25 shows, the network takes two kinds of block structures. Besides taking depthwise convolution (DWConv) to adjust the dimensions, the ShuffleNet block has two branches, which are concatenated and then the channels are shuffled.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, H., Wang, Z., Yue, X. et al. An architecture-level analysis on deep learning models for low-impact computations. Artif Intell Rev 56, 1971–2010 (2023). https://doi.org/10.1007/s10462-022-10221-5

Download citation

Published: 26 June 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10462-022-10221-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An architecture-level analysis on deep learning models for low-impact computations

Abstract

Similar content being viewed by others

Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs

Deep Learning Inference with Dynamic Graphs on Heterogeneous Platforms

Hardware–Software Codesign of an Adder-Tree Type CNN Accelerator

Explore related subjects

1 Introduction

2 Related work

2.1 Deep neural networks

2.1.1 AlexNet

2.1.2 VGGNet

2.1.3 InceptionNet

2.1.4 ResNet

2.1.5 DenseNet

2.1.6 Lightweight architectures

2.2 CPUs for Inference Workload of DNNs

2.2.1 Inference workload on data centers

2.2.2 Inference deployment in terminal devices

2.2.3 Summary

3 Premilinary

4 Characteristics of DNNs on SIMD-CPU Architecture

5 Performance Analysis of Layers

5.1 AlexNet

5.2 VGGNet

5.3 InceptionNet

5.4 ResNet

5.5 DenseNet

5.6 Lightweight Architectures

5.7 Summary

6 Dynamic sparsity investigation

6.1 AlexNet and VGGNet

6.2 InceptionNet

6.3 ResNet

6.4 DenseNet

6.5 lightweight architectures

6.6 Sparsity enhancement

6.7 Summary

7 Convolution operation on CPU

7.1 Data format

7.2 Convolution computing

8 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix: Block architectures of DNNs

A Appendix: Block architectures of DNNs

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation