An architecture-level analysis on deep learning models for low-impact computations

Deep neural networks (DNNs) have made significant achievements in a wide variety of domains. For the deep learning tasks, multiple excellent hardware platforms provide efficient solutions, including graphics processing units (GPUs), central processing units (CPUs), field programmable gate arrays (FPGAs), and application-specific integrated circuit (ASIC). Nonetheless, CPUs outperform other solutions including GPUs in many cases for the inference workload of DNNs with the support of various techniques, such as the high-performance libraries being the basic building blocks for DNNs. Thus, CPUs have been a preferred choice for DNN inference applications, particularly in the low-latency demand scenarios. However, the DNN inference efficiency remains a critical issue, especially when low latency is required under conditions with limited hardware resources, such as embedded systems. At the same time, the hardware features have not been fully exploited for DNNs and there is much room for improvement. To this end, this paper conducts a series of experiments to make a thorough study for the inference workload of prominent state-of-the-art DNN architectures on a single-instruction-multiple-data (SIMD) CPU platform, as well as with widely applicable scopes for multiple hardware platforms. The study goes into depth in DNNs: the CPU kernel-instruction level performance characteristics of DNNs including branches, branch prediction misses, cache misses, etc, and the underlying convolutional computing mechanism at the SIMD level; The thorough layer-wise time consumption details with potential time-cost bottlenecks; And the exhaustive dynamic activation sparsity with exact details on the redundancy of DNNs. The research provides researchers with comprehensive and insightful details, as well as crucial target areas for optimising and improving the efficiency of DNNs at both the hardware and software levels.

The computation of DNNs can be split into two main phases: training and inference Sebastian et al. (2019). The training phase is the process of creating the parameters of a network by learning from the data, such as weights of convolutional layers, and then finetuning the model by paring and optimising it. The inference phase is the deployment of the trained neural network model. The demands of deep learning inference increase sharply as the performance of the models is improved significantly and their applications are popularised greatly. Previous research reported that a significant fraction of future computation demands will come from the workloads corresponding to deep learning inference (Park et al. 2018;Jeong et al. 2022).
GPUs have proved their superior capabilities in DNN computations and played an integral role in the development of deep learning for their generic high parallelism capability required for deep learning models (Kim et al. 2017;Mittal and Vaishay 2019). These play an essential role in improving the performance of DNN tasks. CPUs, on the other hand, are a preferred platform for DNN applications, as they continue to evolve around semiconductor design, semiconductor manufacturing technology, and CNN software frameworks . Modern CPUs include multiple techniques to improve their performance and speed up the processing of data, e.g., thread-level parallelism (Flautner et al. 2000), SIMD, etc. In addition, the one-API Deep Neural Network Library (oneDNN), which is a highperformance library of basic building blocks for deep learning applications, has also been distributed to help developers improve productivity and enhance the performance of deep learning applications on CPU platforms [25]. Furthermore, on x86-64 CPUs, the instruction set architecture (ISA) is detected by oneDNN automatically at runtime, and just-intime (JIT) code generation technique is adopted to generate optimised code to exploit the latest features of the ISA [25]. With the support of oneDNN, CPU performance for DNNs is greatly enhanced.
For embedded systems, such as mobile devices and robotics, the hardware resources are quite constrained while the application efficiency is highly demanded (Mazzia et al. 2020). Neural networks require massive computational and memory resources, which make them hard to be deployed in such environments, including lightweight architectures. Thus, optimising DNNs has been a major research topic (Chung and Abdelrahman 2020;Lee et al. 2020;Mazumder et al. 2021) and has made a series of progress, such as accelerating multi-DNN workloads based on heterogeneous dataflow Kwon et al. (2021), compressing DNNs with vectorized weight quantization Gong et al. (2021), the delay-aware DNN inference technique Li et al. (2021), the error compensation for low-voltage DNN accelerators Ji et al. (2021), pruning of redundancy for DNNs ; Ahn and Kim (2022); Camci et al. (2022), etc. Although the performance of DNNs has been greatly improved with the support of various strategies including these optimization techniques as well as high-efficiency libraries like oneDNN, there is still much room for further performance improvement. Actually, a previous work (Goel et al. 2020) has revealed that many computations in DNNs are actually low impact, and the feature-map sparsity has not yet to be utilised sufficiently. In particular, CPU-based DNN applications have not been fully researched, and CPU features are still far from being fully exploited for DNN workloads. In this paper, we conduct a systematic research on the DNN inference workload concerning a set of prominent networks with an emphasis on the SIMD CPU architectures. The study includes: -The kernel-instruction level insights of DNNs for the inference process on CPUs with the SIMD feature, including the number of instructions, branches, branch miss predictions, cache misses, etc. As well as the overall characteristics of DNNs, including the network architectures, model size, quantity of parameters, and floating-point operations (FLOPs) of the networks. -The layer-wise time performance with definite time consuming bottlenecks of DNN architectures. The characteristic property is up to the exact workload of the network architecture and is closely related to the features of the hardware. -The layer-wise dynamic activation sparsity which reflects the exact layer-by-layer redundancy, as well as providing a clue for enhancing the sparsity. The feature is completely up to the architectures of the DNN networks. -The fundamental convolution computing mechanism at SIMD instruction level, which provides thorough knowledge of convolution operation on CPUs for researchers.
The main contributions of the paper are providing guidelines at both the hardware and software levels for optimising DNNs, with emphasis on SIMD CPUs as well as widely applicable for other hardware platforms including FPGA, ASIC, and GPUs. The study discovers several interesting findings on the characteristic property of DNN's inference process and provides insightful references on identifying performance bottlenecks of DNNs. Furthermore, it also presents some positive feedback on designing new hardware architectures using FPGA or heterogeneous computation units to accelerate DNN performance, and the study is widely applicable. The rest of the paper is organised as follows: In Section 2, we provide the detailed architecture and characteristics of deep learning models studied in the paper at first. Then, we make a detailed discussion on the reason that CPUs outperform GPUs for inference workload of DNNs in many aspects. In Section 3, we introduce the experimental methodology. In Section 4, we analyse the characteristics of CPUs with the SIMD feature. In Section 5, we study the layer-wise time performance which also provides possible time-consuming bottlenecks. In Section 6, we investigate the sparsity with the layer-wise redundancy of DNNs in detail and provide a hint for enhancing the input sparsity of the convolutional layers. In Section 7, we present the principle of convolution operation in CPUs with SIMD instructions. And finally, we conclude this paper in Section 8.

Related work
In this section, we first illustrate the precise structures of DNNs and the general features of the state-of-the-art DNN architectures studied in the paper. The aspects in which CPUs outperform GPUs for inference workload are then discussed in depth.

Deep neural networks
The DNN architectures studied in the research can be simply classified into two categories: modern and lightweight, which refer to seven significant DNN architectures in summary. Modern DNNs include VGGNet (Simonyan and Zisserman 2015), InceptionNet Ioffe and Szegedy July 2015;, ResNet (He et al. 2016) and DenseNet (Huang et al. 2017); And lightweight DNNs refer to the lightweight architectures of ShuffleNet V2 (Zhang et al. 2018;Ma et al. 2018 (2015), DETER Carion et al. (2020), are all on the basis of these DNN architectures. Hence, the research in the paper focuses on the seven DNN architectures. Since each architecture has multiple models with different versions, different channels or layers, etc., only the representative models are taken for every architecture in the experiments. The detailed architectures of the DNNs are analysed as follows.

AlexNet
AlexNet was the first deep neural network that achieved considerable accuracy. The model won the 2012 Image LSVRC-2012 challenge by reducing the top-5 error rate to 15.3% , which is far lower than the runner-up of 26.2% . AlexNet is similar to LeNet except for its depth with more filters per layer and stacked convolutional layers, as Table 1 shows.

VGGNet
The VGGNet architecture is an improvement over AlexNet as it employs a small receptive field, i.e., multiple [ 3 × 3 ] size kernel filters one after another instead of large kernel-sized filters (AlexNet uses [ 11 × 11 ] and [ 5 × 5]). The depth of the architecture is pushed to 10-19 weight layers. VGGNet, as well as InceptionNet, demonstrates that the depth of CNNs with multiple stacked small size kernel filters is a vital factor for network performance. In theory, it enables to learn more complicated patterns by increasing the depth of the network with multiple non-linear layers. The details of the network are presented in Table 2.

InceptionNet
Although increasing both the depth and width of layers improves the performance of neural networks, there are also drawbacks: the enlarged network means a massive number of parameters and dramatically increases computational resource usage. With consideration ReLU, FC ReLU,FC-1000 of these problems, Inception architectures with multi-level feature extractors were proposed Ioffe and Szegedy July 2015;Szegedy et al. , 2017. The computational resources inside a network are optimised in the Inception architectures as its depth and width increase. Fig. 21 and Fig. 22 in Appendix detail the block architectures. The modules act as a "multi-level feature extractor" by computing [ 1 × 1 ], [ 3 × 3 ], [ 5 × 5 ], [ 7 × 7 ], etc, convolutions within an Inception module, which enables the network to acquire multifold information. The outputs of these filters are then concatenated and passed to the next layer.
In this paper, GoogLeNet and Inception v3 are used to analyse the performance of Inception architecture. However, the GoogLeNet studied in the research differs from that of  with optimization, as Table 3 shows. The architecture of Inception v3 is shown in Table 4.

ResNet
As neural networks get deeper, a degradation problem rises: accuracy becomes saturated and degrades rapidly during the training process (He et al. 2016), which indicates that DNNs are more difficult to train. (He et al. 2016) proposed the deep residual learning framework to address the problem by introducing "residual learning". As shown in Fig. 23 in Appendix, the proposal utilizes the "shortcut connections" that jump over some layers with an identity function to avoid the vanishing gradients.  (Huang et al. 2017) proposed the Dense Convolution Network(DenseNet) to strengthen the feature propagation mechanism by exploiting the potential of feature reuse: for each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. The mechanism is detailed in the following equation.

DenseNet
where x l denotes the feature-maps of the l th layer, H l is the function of operations such as Convolution, Batch Normalisation, rectified linear function (ReLU) and Pooling. [x 0 , x 1 , ..., x l−1 ] refers to the concatenation of the preceding layers' outputs.
(1)  Table 6 details the architecture of the network. The method alleviates the problem of exploding gradients, substantially reduces the number of parameters, and strengthens feature propagation.

Lightweight architectures
Lightweight architectures are designed for resource-constrained hardware platforms including mobile devices, self-driving cars, and so on. Designing lighter and efficient architectures is increasingly urgency, for the reason that the aforementioned massive computational and memory demands of DNNs make them difficult to be deployed on resource-limited platforms. This section analyses three state-of-the-art lightweight architectures.
MobileNet (Howard et al. 2017), designed by Google, is based on a streamlined architecture which utilizes depth-wise separable convolutions to reduce the model size and complexity. Table 7 shows the detailed architecture with Fig. 24 in Appendix detailing the block structure. The method replaces the computationally expensive convolutional layers with a cheaper depthwise separable convolutional layer, i.e., a [ 3 × 3 ] depthwise convolutional layer followed by a [ 1 × 1 ] convolutional layer.
ShuffleNet (Zhang et al. 2018;Ma et al. 2018) utilises pointwise group convolution and channel shuffle to optimise the computation costs while maintaining accuracy. Table 8 shows the architecture of the network with Fig. 25 in Appendix details the Block structure. The computation complexity of the costly dense [ 1 × 1 ] convolutions   (1) 56×56 Convolutional layer 1×1 7 × 7 global average pool 1000D fully-connected, softmax are reduced by adopting pointwise group convolutions, and the side effects caused by group convolutions are overcame by utilizing the channel shuffle operation. The hyperparameter, i.e., the last two characters of the two models, is called the depth multiplier, which sets the channels used by the convolutional layers. MnasNet was proposed as an automated mobile neural architecture search approach to find the best suitable structure(Howard et al. 2017). Fig.1 represents the detailed architecture of the network. It managed to balance between the accuracy and latency, and the architecture incorporates model latency into the main objective to identify a model that achieves a good trade-off.

CPUs for Inference Workload of DNNs
GPUs play powerful and irreplaceable roles in artificial intelligence for improving the performance and speed of the training process of neural networks with the nature of parallel processing ability. However, GPUs taking SIMD processing units with the Single Instruction Multiple Thread (SIMT) model are exactly for compute-intensive tasks. They bring their parallel computing ability into full play only when processing huge amounts of data and computations, i.e., the training process of DNNs with massive training data, large batch size, and demanding for enormous calculations. In terms of the inference phase, it generally refers to a frequent or continuous process with few data to be processed each time, i.e., inference workload is parallelism limited that cannot be subdivided into parallelizable processes and needs to be executed in sequence. And the application also demands high strict real-time ability such as an auto-driving system. As for the inference workload of DNNs, CPUs outperform GPUs in many aspects at the moment (Park et al. 2018;xxxx 2018). As the hardware platforms for inference workload mainly are data centers and terminal devices, we make a detailed analysis on the two platforms in this section.

Inference workload on data centers
Data centers demand a high degree of fault-tolerant ability and reliability, sufficient computing availability and flexibility, and adopts powerful CPUs. As for the machine learning inference workload on the data center, first, the inference tasks are diverse for different tasks; Second, the types of data and tensor shapes vary as data comes in a steady stream from the internet; Third, most inference workloads demands low latency. For this case, we make a performance comparison on the inference workload of multiple DNN models for processing a random task request based on CPUs and GPUs, i.e., processing one input which simulates the demand for the real time whenever there is a task request. The overall running time for the task and the inference time for DNNs processing the input are taken as the index. Table 9 shows the experimental results. In terms of the inference time, the GPU is far beyond the CPU: the inference time of DNNs on the GPU varies from 3.2× to 54.2× faster than that on the CPU. Nervertheless, the time cost of a task for DNNs, i.e., the running time, is far more than the inference time of the DNN network.
The running time of a task for DNNs consists of the model load time which is a heavy burden in such conditions, the preprocessing for inputs, the model inference time for the network processing the single input, the post processing, and so on. According to the experimental results, the running time of modern DNNs on GPU varies from 1.9× to 7× more than that of CPU; As for the lightweight architectures, especially for the ShuffleNet V2 _ 1.0, the running time on GPU is 16× than that of CPU. It is obvious that the CPU outperforms the GPU in such cases. The data strongly proves the efficiency of CPUs for DNNs in terms of such cases, i.e., the inference workload with parallelism limited and demanding for high real time.
Considering these factors, although data centers are also equipped with GPUs, GPUs are incompetent or inadequate in such cases with limited parallelism. And taking the Facebook data centers for example, it is exactly CPUs that undertake their most inference workloads (Park et al. 2018).

Inference deployment in terminal devices
Terminal devices including robotics, auto-driving systems, mobile devices, embedded devices, edge devices, etc., constitute a significant amount of applications in various domains for DNNs inference tasks. First, there are cases where CPUs are the only available resources for embedded systems such as Raspberry Pi system and Single-Chip Microcomputer system. Second, in terms of mobile devices, GPUs are not the best choice for the reason that not all machine learning computations are supported by GPUs which results in extra CPU-GPU co-processing overheads and is more time consuming (Mittal et al. 2021). (Wu et al. 2019) indicates that only 11% of the smartphones have a GPU that is 3 times more performance than its CPU, more than 40% of the smartphones have CPUs outperform its GPUs. The theoretical peak performance of mobile CPUs has little difference from mobile GPUs (Wu et al. 2019). CPUs provide no less even better performance than GPUs, especially for the lightweight architectures (Kim et al. 2019). Third, the limited memory bandwidth capacities restricts greatly the performance of GPUs in mobile devices for which has no high-bandwidth memory like discrete GPUs. Fourth, GPUs cost more energy, which is also a huge burden for power-limited mobile devices (Fong et al. 2017). In addition, the usability and programmability of mobile GPUs are fragile (Wu et al. 2019).
As a matter of fact, for the machine learning inference at Facebook, only a small fraction of inference currently run on mobile GPUs (Wu et al. 2019), and mobile services occupy no less than 90% of its advertisement revenue [59]. Research on optimizing the inference of DNNs for edge devices also has always been making progress Patel et al. (2022): Mobile-Dets outperform MobileNetV3+SSDLite by 1.7 mAP for the inference latencies on mobile CPU Xiong et al. (2021);de Prado et al. (2021) proposed an automated exploration framework which achieves 4× improvement in performance and over 2× reduction in memory compared to the BLAS floating-point implementation. As a result, CPUs are better choices for inference workloads on terminal devices in many cases.

Summary
As for the inference tasks, considering the data as well as the network loading overhead of GPUs (Ma et al. 2019), data transferring overhead between GPUs and CPUs, the energy consumption of GPUs, the flexibility of inference tasks with limited parallelism, the high latency of GPUs, GPUs are incompetence compared with CPUs. CPUs are more suitable for deep learning inference workload in many cases (Mittal et al. 2021;Kim et al. 2019) and numerous studies have been made to optimize and accelerate DNNs on CPUs (de Prado et al. 2021;Kim et al. 2020;Low et al. 2020;Putro et al. 2021). Thus, in this paper, we focus on the study of DNNs inference performance on CPUs.

Premilinary
This section provides the preliminary of the experimental setup. In terms of the machine learning frameworks, there have been multiple representative platforms, such as Tensor-Flow, PyTorch, Keras, etc. For the reason that the experiments in the study are up to the architecture of the network and have nothing to do with the machine learning frameworks, we simply adopt PyTorch (version 1.7.0a0) in this paper.
The hardware platform is a Dell OptiPlex 700 equipped with an Intel i7-9700 CPU at 3.00GHz and 2 ×32GB DDR4 main memory. The operating system of our experimental platform is Ubuntu 20.04. The experiments adopt the pretrained networks on the ImageNet ILSVRC 2012 dataset in PyTorch. The training dataset does not influence the analysis of the Characteristics concerning DNNs on a CPU architecture with SIMD in Section 4 and layer-wise time performance in Section 5. The experimental results of Section 4, Section 5, correspond to the inference process of one input which are averaged from 1000 inputs. In the experiments, the input images are unified to be RGB channels and cropped from the centered [ 224 × 224 ] for input consistency across all networks. The inputs are also preprocessed with the mean and standard deviation normalization.

Characteristics of DNNs on SIMD-CPU Architecture
SIMD instructions have been typically supported by various CPU architectures. For example, Intel's Streaming SIMD Extensions (SSE), Advanced Vector Extensions (AVX), AVX2, and AVX512, which are also followed quickly by AMD except for AVX512; ARM's Neon technology; AltiVec which is designed and owned by Apple, IMB, and Freescale Semiconductor. In this section, the characteristics of DNN's inference process are analysed on the SIMD-CPU architecture by utilizing the profiler tool Perf for Linux 2.6+ based systems to gather the performance counter statistics from it. Table 10 shows the experimental results, which include FLOPs, the number of instructions (num. of inst.), branches, mispredicted branches (branch misses) and cache misses. Furthermore, the inference result indexes are also listed, such as the top-1 accuracy and inference time. The network sizes and parameter numbers are also listed in Table 10 for their importance.
As DNNs differ greatly in exploiting inter-operation parallelism, thread-level parallelism is disabled in the experiments to achieve a fair comparison for different networks.
The valuable findings from Table 10 are as follows: FLOPs: FLOPs indicate the complexity of the models. The index is hardwareindependent and depends on the architecture of the network. FLOPs are the main workloads of DNNs and the computing concerning convolution operation is highly optimised and accelerated with SIMD instructions, which is detailed in Section 7. As the table shows, VGG19 _ BN has relatively fewer layers with only 13-19 layers, but the architecture takes the most inference time of 256.03-384.84 ms with the most instructions of 2577.19-4302.47 million and FLOPs of 11.44-19.77 G. ShuffleNet V2 _ 1.0 takes the least inference time of 9.57 ms with the least instructions of 55.34 million and FLOPs 0.05 G, and the size of ShuffleNet V2 _ 1.0 is also the smallest.
Num. of inst.: The number of instructions for DNNs is not only related to the architecture of the network, but also the characteristics of CPU platforms, i.e., the instruction set supported by the CPU. The paper gives the exact number of instructions for various DNNs in this section. As the instructions are mostly about the FLOPs, then the two indexes are closely correlated with each other. It is obvious that VGGNet architecture takes the most instructions and the lightweight networks take the least.
Branches and branch misses: For CPUs, branch predictors play a critical role in achieving high performance, and the number of branch misses is a vital factor for modern deeply pipelined microprocessor architectures (Meng et al. 2012). Branch instructions and misses are major obstacles to pipelining processes. Models with fewer branch instructions and misses generally perform relatively well. From these models, GoogLeNet has the maximum number of branch misses. Lightweight architectures benefit from fewer branch instructions and misses. However, the branch miss rate of ShuffleNet V2 _ 1.0 is several times higher than that of others: 1.04×-2.19× . Also, MobileNet V2 has the lowest branch miss rate. The two indexes together provide the importance and performance of branch predictors of current CPUs.
The branches per kilo instructions (BPKI) of lightweight neural networks are much higher than those of other models in general. ShuffleNets V2 _ 0.5, which has the least instructions, has the most BPKI with a value of 189.67, which is about ten times more than those of VGG19.
Lightweight models have a larger number of misses per kilo instructions (MPKI) with a value of no less than 2, except for MobileNet V2 with a value of 0.99.
Cache misses: CPU caches play a critical role in mitigating the performance gap between the CPU and main memory (Yu et al. 2001). A cache miss occurs when the requested data cannot be found in the cache and significantly slows down the overall execution process. The index is closely related to the size of the models, the number of parameters, the amount of data processed, etc. Therefore, models with a larger size and more parameters often have more cache misses. As the table shows, lightweight models with smaller sizes and fewer parameters have relatively better performance. For example, VGG19 _ BN and ShuffleNet V2 _ 1.0 have 22.00 million and 1.22 million cache misses, respectively.
Model size and parameters: The size and parameters are important for DNN models and depend only on the architectures of DNNs. In general, models with smaller sizes and fewer parameters take less computation and memory resources, as well as are always more computationally and memory efficient. However, this is not always the case, the exact architecture of the networks has a big effect on it. For example, Although the model size of MnasNet10 is 24.26% more than that of MobileNet V2, the time cost of MnasNet10 is 17.14% less than that of MobileNet V2. The reason is that the architecture of MnasNet10 is less complex than that of MobileNet V2, i.e., as is detailed in Table 10, MnasNet10 has 8.58% fewer instructions, 4.43% fewer branches, 7.69% fewer branch misses, and 19.93% fewer cache misses than MobileNet, which are up the architecture of the network.
Although AlexNet and VGGNet have much fewer layers, i.e., 5 and 13-16 convolutional layers for AlexNet and VGGNets, respectively, the model size is much larger than that of others, i.e., VGG19 has a size of 548 MB, which is about 18 times larger than that of DenseNet121, which has 120 convolutional layers. Since the three lightweight models are designed mainly for resource-constrained devices, the models have a smaller model size and fewer parameters, i.e., no more than 17 MB and 5 million, respectively, which is far lower than those of other models. It can be seen that the size and parameters of DNNs have little relation to the layers of the model.
The indexes above altogether determine the running time of the networks. This section provides developers with a thorough CPU kernel-instruction level performance analysis of DNNs as well as widely applicable references for various hardware solutions for DNNs. Although the investigation is conducted on SIMD-CPU architectures, the hardware irrelevance indexes, i.e., FLOPs, model size, and parameters, provide widely applicable information for various hardware platforms.
As for the hardware-related indexes, the num. of inst., branches and branch misses, and Cache misses, provide developers with a thorough understanding of various DNNs' inference phase at a kernel-instruction level for CPU architectures, with detailed performance differences concerning state-of-the-art DNN architectures. The investigation also indicates possible research directions. For example, except for the lightweight architectures, the branch misses are still very high for other DNNs, by making further studies to reduce branch misses which is key for pipeline processes may greatly speed up the inference process. It also needs to be noted that there are major differences between CPUs and other hardware platforms. For example, prevalent GPUs do not have branch predictors, the cache operation of GPUs is quite different from that of CPUs, FPGA, and ASIC are integratedcircuit solutions that are completely different from both CPUs and GPUs. However, the hardware-related indexes also provide more or less hints for other hardware platforms. For example, the current branch prediction technique is general for various tasks, while the specific branch predictor for DNNs may be more efficient. Then the hardware-based accelerator design such as FPGA and ASIC, with the specific branch predictor may effectively improve the inference process by utilizing the dynamic activation sparsity analysed in Section 6.

Performance Analysis of Layers
The running time is crucial to evaluate the efficiency of DNNs. The exhaustive time performance analysis of DNN architectures provides a well-grounded knowledge of DNNs concerning time consumption, as well as accurate information for considering the time efficiency to design a high-performance network. A DNN typically consists of multiple layers of artificial neurons that correspond to convolutional layers, non-linear scalar operator layers, down-pooling layers, FC layers, etc. In this section, a layer-wise time performance analysis is conducted on DNNs by utilizing the profiler tool which has been integrated into PyTorch to profile each CPU kernel implementation to gain insights into neural network performance. The PyTorch profiler provides the performance metrics for the training and inference process, including the input shapes and stack traces, kernel activity, and the execution trace.  Table 1. The five convolutional layers account for about 47% of the inference time, the last three FC layers take more than 47% of the inference process with the first FC layer consuming more than 32%. As for the rest layers, the time cost is negligible. Figure 3 shows the VGGNets layer-wise time consumption: the three versions differ only in the number of convolutional layers. The convolution operation can be divided into five stages from conv1 _ x to conv5 _ x in accordance with the input and output. For the first four stages, the first convolutional layer has fewer input channels compared with the other convolutional layers within the stage, resulting in a reduced time consumption for the layer.

VGGNet
In summary, convolution operations predominate the time cost of the inference process, which account for up to 80% . The layer-wise time cost is consistent for the convolutional layers which have the same input and output shape. For example, the last three convolutional layers of the stage conv3 _ x and conv4 _ x, and the four convolutions layers of conv5 _ x of network VGG19. The time cost of the max-pooling and full-connected layers together is around 22% of the inference process, and the time cost of other layers is negligible.
Despite concerns that VGGNets demand a longer time for training and require more weights that lead to a heavy architecture as Table 10 shows, the architecture is still a favoured choice on many occasions for its superiority in feature extraction and transfer learning tasks. Figure 4 shows the layer-wise running time performance of GoogLeNet and Inception V3. The patterns shown in the figure represent a certain periodicity in the stacked "Inception modules" architecture. The layer sequence shown in the figure, which corresponds to the block, is shown from left to right and bottom to top. In addition to the convolution operations, the thirteen layers with a percentage higher than 0.5% in GoogLeNet and the ten layers with that greater than 1 % in Inception v3 are all max-pooling layers. As a result, the pooling operation occupies a large percentage of the inference process, which is quite distinct from the VGGNet architecture. Figure 5 demonstrates the layer-wise time performance of ResNet34 and ResNet101. Except for the only max-pooling layer for the architecture that has the highest time cost than other layers in terms of a single layer, the time cost of convolution computing is nearly 90% of the inference process. Also, the time consumption for ResNet architectures changes periodically, which agrees well with the residual block in Fig. 23. The three downsample layers take little time in ResNet34 with no more than 0.5% of the whole time. As for   Figure 6 shows the layer-wise time performance of DenseNet architecture: the fourth bar in the histogram represents the time consumption of the only max-pooling layer for the architecture with the highest value in terms of single-layer time cost. The four groups of bars from DB1 to DB4 with values increasing linearly refer to the pointwise convolutional layers as their input channels increase linearly. The bars in the four groups with equal values reflect the convolutional layers with [ 3 × 3 ] size kernels. The regularities of the figure reflect the architecture of the network to a great extent.

Lightweight Architectures
This section investigates the layer-wise time performance of the three lightweight architectures.
MobileNet: The layer-wise time consumption of MobileNet V2 is shown in Fig. 7, with convolution computing accounting for about 78.65% of the inference time. In addition to the last FC layer, It needs to be noted that the 10th layer, which is the Batch-Normalization operation in bottleneck 2, takes a relatively high single-layer time consumption. This characteristic is quite different from other DNN architectures. ShuffleNet: In terms of ShuffleNet V2_1.0 architecture, the layer-wise time cost performance shown in Fig. 8 represents the characteristics corresponding to the block shown in Fig. 25. In terms of a single layer's time cost, the only max-pooling layer takes the highest percentage with a value of 12.40% for the network. The last fully-connected layer takes a relatively high percentage in time cost.
MnasNet: The layer-wise time consumption is shown in Fig. 9, with convolution computing accounting for about 79.28% of the inference time. The time cost of the first five and the last two convolutional layers is higher than that of the other convolutional layers. The last fully-connected layer takes the highest percentage of time cost; The time cost of the other layers is negligible.  1 3

Summary
The overall time cost of convolution operations, as well as the percentage to the model inference time of the aforementioned DNNs, are detailed in Table 11, and the results are also averaged from 1000 inputs. Based on the analysis above, conclusions can be drawn that the convolution operation predominates the inference workloads of DNNs. With the exception of individual models like AlexNet, GoogLeNet, and shuffleNet V2 that have relatively low time consumptions of 42.30% , 50.02% , 50.09% and 63.33% , respectively, the other networks have a much higher percentage of time consumption from 72.53% to 90.34% . Besides convolutional layers, max-pooling and fully-connected layers have a relatively high time cost for InceptionNet, ResNet, DenseNet, and so on, and the time consumption for layers such as ReLU layers is negligible. For this discovery, we also make further analysis. Taking ResNet34 as an example, table 12 shows the FLOPs, Branches, Branch misses of the first four layers, i.e., convolutional layer, BN layer, ReLU layer, and Max-pooling layer. It is obvious that the convolutional layer takes far more FLOPs than others, which is 73× -290× more than that of the other three layers. As a result, convolution operation dominates the time cost of DNNs, and the FC layer which takes similarity operation as convolution computing also consumes more time than others. As for the Max-pooling layer, although FLOPs are little compared with the convolutional layer and the number of branch instructions is only about 1.6× more than that of the convolutional layer, the branch misses of the Max-pooling layer is about 13× more than that of the convolutional layer. It is well known that branch instructions and misses are major obstacles to pipelining processes which greatly reduce the efficiency of CPUs. As a result, max-pooling computing takes comparatively much more time in terms of single-layer time cost. In addition, the time cost of other layers such as ReLU and BN layers within DNNs is negligible. The study in the paper especially provides researchers with valuable details on the effect caused by the branch misses, the importance of branch misses for CPUs.
The layer-wise time performance analysed in this section provides thorough details on time consumption of various deep learning architectures, with possible time-consuming bottlenecks for DNNs. Although the study in this section is conducted on CPUs and the analysis is closely related to the hardware characteristics, the experimental results also provide more or less reference for other hardware platforms including GPUs, FPGA, and so on. For the reason that it is exactly the workload of each layer which is up the architecture of DNNs that determines the layer-wise time performance, and the workload is same for all hardware platforms.

Dynamic sparsity investigation
Computation intensity is a heavy burden, especially for the hardware with constrained resources such as embedded devices. As has been analysed in Section 5, it is exactly the convolutional operation with a mass of computation that takes up the most workload of DNNs. As a result, reducing the intensity of convolutional computation has been an important research topic. The innate characteristic of DNNs, i.e., sparsity, which is defined as the fraction of zeros in the parameters and input activation matrices in this paper, reflects that DNNs are heavily over parameterized and have massive redundant computations which are meaningless for the network. Thus, sparsity emerges as a quite effective solution for the reason that the operations concerning sparsity are meaningless and can be removed to greatly improve the speed of the network (Guo et al. 2020; Gong et al. 2020). The sparsity can be simply classified into two categories (Zhou et al. 2018): static and dynamic. Static sparsity means the sparsity in the parameters of the DNN models (i.e., weight sparsity), while dynamic sparsity is introduced when the value of the neuron elements becomes 0 and does not contribute anything to the follow-up neurons. For the reason that the static sparsity is training phase determined characteristics with various extra algorithms to sparse the parameters that belong to the research of training process. Then, in this paper, we focus on the analysis of the layer-wise dynamic activation sparsity of the convolutional layer inputs. The experimental results in this section are averaged from 100 inputs. It needs to be noted that the input sparsity of the first convolutional layer is zero, i.e., the raw input data of the network has no sparsity. Figures 10 and 11 show the overall trend of sparsity for plain models of AlexNet and VGG-Net. In terms of AlexNet, the input sparsity of the last two convolutional layers is up to about 80% . As for VGGNets, the input sparsity of the convolutional layers that have the max-pooling operation before the layer has an obvious decrease compared with the former convolutional layer. Also, for the sparsity of the last several convolutional layers, the input sparsity reaches a high value of no less than 80%. Figure 12 shows the details of GoogLeNet and Inception v3. For the "Inception module" stacked architectures, the sparsity of Inception-Net represents the periodicity along with the Inception blocks. In terms of GoogLeNet, the 9th, 15th, 21st, 27th, 33rd, 39th, 44th, and 50th convolutional layers, which correspond to the [ 1 × 1 ] convolutional layer after the max-pooling layer within the Inception block shown in Fig. 21, have an obvious decrease in sparsity. As for Inception v3, the model has the same pattern as that of GoogLeNet. The overall sparsity of Inception-Net keeps rising on the whole as the layers get deeper. The average input sparsity of GoogLeNet and that of Inception v3 is no less than 50% and 60% , respectively. Fig. 12 InceptionNet layer-wise sparsity Figure 13 shows the details of ResNet. In terms of ResNet-34, the figure shows a feature that the input sparsity of the first [ 3 × 3 ] convolutional layer is much less than that of the second one within the residual block, which is shown in Fig. 23, because there is a ReLU function before the second convolutional layer but not for the first convolutional layer within the block. The average input sparsity of the convolutional layers is no less than 50%. As for ResNet-101, the input sparsity of the [ 3 × 3 ] convolutional layer is much less than that of the two [ 1 × 1 ] convolutional layers. For example, the input sparsity of the 35th convolutional layer is about 30 % less than that of the two [ 1 × 1 ] convolutional layers within the residual block- Fig. 23. The average input sparsity of the convolutional layers is no less than 55 %.

DenseNet
The increasing and decreasing changes of sparsity shown in Fig. 14 reflect the "bottleneck" architecture of the network: the layer structure of BatchNorm(BN)-ReLU-Conv([1 × 1])- ). The results show a feature for DenseNet that the input sparsity of the [ 1 × 1 ] convolutional layer is obviously less than that of the [ 3 × 3 ] convolutional layer for the convolutional layers within the Dense Blocks. In general, the sparsity keeps rising as the layers get deeper, and the percentage can be up to no less than 90 % for the last several [ 3 × 3 ] convolutional layers.

lightweight architectures
This section investigates the input sparsity of convolutional layers on the three lightweight architectures.
MobileNet: Fig. 15 shows the dynamic sparsity of MobileNet V2. For the reason that there is no ReLU operation before nearly one-third of convolutional layers, i.e., the last [ 1 × 1 ] convolutional layers of bottleneck1 and bottleneck 2, and the last convolutional layer of the network, the sparsity of these convolutional layers is zero. The average dynamic sparsity of the other two-thirds is about 40% . Especially, it needs to be noted that the input sparsity of the second to last convolutional layer is only about 8%.
ShuffleNet: Fig. 16 shows the results of ShuffleNet V2. The sparsity of the right branch [ 3 × 3 ] convolution of Block B, which is shown in Fig. 25, i.e., the 4th, 19th, and 44th convolutional layers, has an obvious increase of about 30 % compared with that of the former convolutional layer within the block. Since there is no ReLU function before a number of convolutional layers, i.e., the last [ 1 × 1 ] convolutional layers of Block A and Block B, the input sparsity of about one-third of convolutional layers is zero. The average sparsity of the other two-thirds is about 50 %.
MnasNet: Fig. 17 shows the results of MnasNet 1.0. For the first repeat cycle MBConv-A and MBConv-B, the sparsity of the corresponding [ 3 × 3 ] and [ 5 × 5 ] convolutional layer is higher than the following [ 1 × 1 ] convolutional layer within the block; The other layers contrast this. As for MBConv-C and MBConv-D, except for the 26th convolutional layer, the sparsity of the corresponding [ 3 × 3 ] and [ 5 × 5 ] convolutional layers is higher than the following [ 1 × 1 ] convolutional layer within these blocks.

Sparsity enhancement
The sparsity in DNNs is largely attributed to the activation function, ReLU (Agarap 2018). In this section, we make an explore to enhance the dynamic sparsity by utilizing the ReLU function for the inference phase after the networks are trained.
The activation function is an extremely important feature of artificial neural networks as it helps a model account for interaction and non-linear effects. This enables DNNs to learn complex patterns from the data. All of the DNNs analysed in the paper use ReLU as the default activation function. The function can be described in the following equation.
In the formula, the threshold is equal to 0.
For the sake of convenience to analyse the effect of the threshold value, VGGNets were used to carry out the experiment for their simplicity. First, the distribution of the VGG16 convolutional layers' input data was investigated. From the surveys, it can be concluded that the trend of the data is consistent. As the length is limited, only three figures of different inputs about VGG16 are shown in Fig. 18, and both VGG13 and VGG19 share the same patterns. Figure 18 shows that the numerical distribution of input is quite different for the first convolutional layer, of which the data is the raw input of the network. However, the x for x >= threshold distribution of input value tends to be consistent from the second convolutional layer to the last one. Then, the threshold of ReLU layers is adjusted in accordance with Fig. 18, and the following experiments are conducted. First, we investigate the corresponding convolutional layer-wise dynamic sparsity and then test the precision variation. The experimental results are shown in Table 13, which is averaged from 100 inputs. A. Thresholds of ReLUs from the 1st ReLU layer to the 12th ReLU layer are changed to 0.1, others remain unchanged.
B. Thresholds of ReLUs from the 1st ReLU layer to the 12th ReLU layer are changed to 0.5, others remain unchanged.
C. Thresholds of ReLUs from the 7th ReLU layer to the 12th ReLU layer are changed to 2, others remain unchanged.
According to the experimental results shown in Table 13, zero maybe is not the best threshold value for the accuracy of the models as the accuracy increased from 72.2% with a zero threshold to 72.4% in scheme B, and the sparsity slightly increased. At the same, increasing the threshold value of trained DNNs brings more or less accuracy loss. With an acceptable accuracy loss, the dynamic sparsity can be significantly increased for a number of convolutional layers by appropriately adjusting the threshold of ReLU, such as conv1 being increased by about 34% in scheme B, and the last six convolutional layers increased  by about 6 % in scheme C. This indicates a direction for both improving the accuracy and increasing the sparsity of DNNs.

Summary
The data shows that the activation sparsity of convolutional layers is quite high with specific characteristics in accordance with their respective architectures. Even for lightweight architectures which have been highly optimised and compressed, two-thirds of the average activation sparsity is no less than 40% . At the same time, although the dynamic sparsity is closely related to the dataset, previous studies have shown that the dynamic sparsity for other datasets such as CIFAR100 and Kuzushiji is no less than the percentage concerning ILSVRC 2012 studied in this paper as long as the network is fully trained ). Thus, the findings are widely applicable to various datasets.
In general, the experimental results in this section convincingly demonstrate that DNNs have an overwhelming amount of redundant computations concerning the dynamic activation sparsity for the reason that the computations concerning the sparsity are absolutely meaningless and thus can be omitted. This indicates that DNNs have significant room for improvement. FPGA provides a potential solution for this issue. For example, ) designed the accelerator Crane with the load-balancing method to address various types of sparsity irregularities in DNNs, and achieves 1.27× 1.88× speedup compared with the state-of-the-art accelerator baselines. So, although the high percentage of dynamic sparsity makes it hard to be fully utilized for its irregularity but also leaves much room to be further studied for performance improvement, especially for the SIMD-CPU architecture. The dynamic activation sparsity analysed in this section provides valuable references with detailed layer-wise redundancy for optimising and accelerating DNNs.

Convolution operation on CPU
As is analysed that convolutional operations dominate the time consumption of DNNs in Section 5, we make a thorough analysis on the principle of convolutional computing upon modern CPUs to provide potential solutions for accelerating DNNs by utilising the sparsity analysed in Section 6.
In the contemporary era, CPUs can achieve peak performance through SIMD extensions. With the support of SIMD instructions, a single processor performs the same operation simultaneously on multiple pieces of data, by loading multiple values into wide vector registers(e.g. YMM registers) to be processed together (Cardoso et al. 2017). For example, AVX2 uses sixteen YMM registers to hold and perform simultaneous operations on eight 32-bit single-precision or four 64-bit double-precision floating-point numbers per CPU cycle. The more advanced AVX-512 handles up to 512-bit data. In addition, these instruction sets introduce fused multiply-accumulate(FMA) operations (Nasiri et al. 2014), which execute one vectorized multiplication and then accumulate the results to another vector register in the same CPU cycle. SIMD sets have been easily incorporated into C and C++ through the use of API extension sets referred to as intrinsic functions or intrinsics, which provide access to SIMD sets and are suitable for the most commonly-used C and C++ compilers [82]. Thread-level parallelism is explicitly represented to improve the overall processor performance through the use of multiple threads of execution. These techniques have been utilised by MKL-DNN to optimise DNN operations.
Computation and memory efficiency are two vital factors for hardware to process data; DNNs are no exception. In this section, as the convolution computing predominates DNN operations, the operation and the corresponding data format that oneDNN utilises on CPUs are detailed.

Data format
Computations are mostly about data: reading, processing, storing, etc. In the process of convolution, feature maps and weights/filters require appropriate formats in memory to leverage SIMD instruction sets. This section represents the data formats for DNNs. Figure 19 shows the data format of feature maps. The activations consist of four dimensions: batch(N), channels(C), height(H), and width(W). The format of the feature maps stored in memory is up to the orders of the four dimensions. NHWC, NCHW, and CHWN are commonly-used formats and can be created with MKL-DNN's built-in functions [25]. In the paper, activations are laid out in memory with the NCHW format, which means the overall dimensionality of the input and output are 4D tensors with batch as the outermost and width as the innermost dimension of the data. In addition, it needs to be noted that the batch number is zero for the inference phase. To get the data in memory, it is necessary to define how to map the 4D-tensor to a 1D tensor via an offset function that takes a logical index (n,c,h,w) as an input and returns the address displacement to where the value is located [25]. Taking the NCHW format as example, the function can be described as: where w denotes the inner-most dimension: the corresponding elements share the same indices of n, c and h, and are adjacent in memory; In addition, their indexes of w would be different by one; This is always true for only non-border elements; In terms of n, it is the outermost dimension here: it needs to jump over the whole image size C*H*W to take the same pixel (c,h,w) but only on the next feature.

Convolution computing
Convolution operations applied in DNNs correspond to a series of multiplications and accumulations. Taking the first convolutional layer of VGGNet as an example, the key AVX2 instructions-AT &T format-to perform convolution computing is shown in Fig. 20. (2) off set_nchw (n, c, h, w)  To fully utilise the potential of SIMD instructions, the process can be simply divided into three steps, and the basic computing unit calculates eight elements that correspond to the same location of eight output channels simultaneously in one cycle. The SIMD instructions load feature-map elements and weights to vector registers, and then implement the multiplication and accumulation operations with FMA instructions. S1: Using the "vbroadcastss m32, %ymm1" instruction to broadcast the feature-map element (e.g., R11) in the memory address to the eight locations in the YMM register.
S2: Using the "vmovups %ymm2/m256, %ymm1" to move eight weights (the eight values corresponding to the eight filters: r111 to r811) in the memory address to the eight locations in the YMM register.
S3: Using the "vfmadd231ps %ymm3/m256, %ymm2, %ymm1" instruction to multiply the eight packed single-precision floating-point values in the first source operand and the second source operand, adding the intermediate result to the third source operand, performing the rounding operation, and storing the result values to the third source operand, i.e., the destination operand.
The process from S1 to S3 will be iterated 27 times to get the final results from m1 to m8, for the reason that the input channel is RGB and the kernel size is [ 3 × 3].
The analysis in this section provides in-depth knowledge about the convolution operation for DNNs as well as the related data format that fits the computing upon modern CPUs. In terms of SIMD CPU architectures,  proposed to skip the meaningless computing concerning dynamic sparsity with branch commands at SIMD instruction level. However, due to the branch misses brought by the proposal, the method simply achieved the performance improvement by 10.26% on dataset ImageNet for certain convolutional layers in the inference process. Despite the computing architecture being for SIMD-CPU architecture, it also provides significant references for hardware accelerator design on hardware level, for example, FPGA and ASIC based design.

Conclusion
The paper conducts a systematic, comprehensive, and thorough inference workload study on the prominent DNN architectures. First, the kernel-level analysis, instruction-level convolution computing mechanism, and the corresponding data structure, provide in-depth knowledge for the characteristics of DNNs on SIMD CPUs. This indicates promising research directions with precise critical points for the future development of DNN accelerators, especially the prospective hardware solutions like FPGA-based design and heterogeneous accelerators. Second, the layer-wise time performance of DNNs offers exhaustive details about the time cost of each layer and the possible time-consumption bottlenecks of the networks. Third, the layer-wise dynamic activation sparsity provides researchers with precise layer-wise redundancy of DNN networks; The data also proves conclusively that DNNs are heavily over parameterized with massive redundant computations. In addition, although the research is conducted on a SIMD-CPU architecture with PyTorch, the findings are widely applicable to all machine learning frameworks and various hardware platforms including FPGA, GPUs, and ASIC. To summarize, the research offers significant references for optimising and accelerating the inference workload of DNNs at the instruction level by improving the time-consuming bottlenecks and eliminating the redundant computations concerning the feature-map sparsity at both the hardware and software levels.

A Appendix: Block architectures of DNNs
Inception v1 block: As Figure 21 shows, it consists of four branches with two different sizes of filters. The main feature is that block adopts the dimension reduced approach, i.e., the number of input channels is limited by utilising an [ 1 × 1 ] convolution before the [ 3 × 3 ] convolution. Finally, the outputs of the four branches are concatenated and sent to the next block.   Inception v3 block: As Figure 22 shows, there are five different kinds of inception modules. In terms of Inception A, it features that the module factorizes the [ 5 × 5 ] convolution to two [ 3 × 3 ] convolution operations to improve computational speed. The [ 5 × 5 ] convolution is 2.78 times more expensive than a [ 3 × 3 ] convolution. In terms of Inception C, D, they factorize convolutions of filter size [ n × n ] to a combination of [ 1 × n ] and [ n × 1 ] convolutions. As for Inception E, the filter banks in the module are expanded wider instead of deeper to avoid the excessive reduction in dimensions which leads to loss of information. Residual block: As Figure 23 shows, the module features by utilising the "shortcut connections" to address the degradation and the saturation or even degrading rapidly of the gradient of DNNs as the network gets deeper. The left module is for ResNet34, and the right module is for ResNet101. Figure 24 shows, the network adopts two kinds of bottlenecks. The bottleneck features by adopting depthwise convolution (Dwise) for the [ 3 × 3 ] convolution to reduce the model size and complexity. The [ 3 × 3 ] convolution is used to increase/decrease the dimensions.