1 Introduction

Over the past years, convolutional neural networks (CNNs) have exhibited an outstanding accuracy in a myriad of applications, including but not limited to face, speech, image or handwriting recognition, autonomous driving, and automatic medical diagnosis [1, 2]. The power and effectiveness of CNNs are due to the convolutional operators utilised therein, which can automatically detect distinctive image features while reducing the arithmetic costs and memory consumption required by previous approaches. This operator, however, is responsible for a substantial portion of the computational costs required for CNN training and inference [2]. Consequently, significant efforts have been devoted to developing efficient algorithms for this particular computational core in almost all current processor architectures [3, 4]. In applications where CNNs inference is deployed on smart sensors or battery-operated devices equipped with micro-controller units (MCUs) (e.g. ARM Cortex-M CPUs) or low-power processors (e.g. ARM Cortex-A CPUs), the optimisation of this operator is strongly focused on reducing its energy consumption [5].

In this paper, we contribute to this line of work with a comprehensive analysis of the performance and energy efficiency of different convolution algorithms for deep learning (DL) inference on a collection of ARM-based processor architectures. Specifically, we make the following major contributions:

  • We describe our high-performance implementations for the directFootnote 1, explicitFootnote 2 and implicit loweringFootnote 3, and WinogradFootnote 4 convolution algorithms optimised for ARM processors and their integration within PyDTNN, a Python framework for training and inference of deep neural networks (DNNs) [6].

  • We characterise the target ARM processors of different NVIDIA Jetson platforms based on Orin, Xavier, Nano, and TX2, as well as their execution models and the available power rails for measuring the CPU and memory energy consumption.

  • We assess the performance and energy efficiency of the ResNet-50v1.5 CNN on varying configurations of convolution algorithms, number of threads/cores, ARM processors and operating frequencies. For each tested processor, we determine the best-performing configuration and the most energy-efficient scenario.

The rest of the paper is structured as follows. In Sect. 2, we describe some related works and compare their analysis with those obtained in this study. Next, in Section 3, we describe different convolution algorithms and optimisations for ARM processors. In Sect. 4, we evaluate the performance and energy consumption under different configurations, and finally, in Sect. 5, we close the paper with a few remarks and a brief discussion of future work.

2 Related work

Among different methods proposed in the literature for the convolution operator, we can list (i) the direct algorithm, usually implemented as six nested loops around a multiply-and-add instruction [4]; (ii) the lowering (also known as im2col/im2row -based) approach, which transforms the input image(s) into a matrix in such a way that a general matrix–matrix (gemm) multiplication can then be used to compute the convolution [7, 8]; (iii) the FFT-based algorithm, which shifts the computation into the frequency domain in order to reduce the arithmetic requirements [9,10,11]; and (iv) the Winograd-based convolution, which leverages the Winograd minimal filtering algorithm to decrease the arithmetic cost of the convolution [9, 12]. The general view of these methods and some of their corresponding high-performance implementations for MCUs and low-power processors (e.g. CMSIS-NN [13], ARM Compute Library [14], and NNPack [15]) is that the best option from the performance and energy viewpoints largely depends on the parameters that define the convolution operations (i.e. the dimensions of the filters and the image, the batch size, etc.), as well as different performance states and working modes provided by each processor architecture.

To perform DNN inference on low-power processors and MCUs, their limited computational and memory capabilities require aggressive optimisations in both algorithms and codes to be carried out. For instance, the authors in [16] propose a series of incremental improvements for DNN inference, such as the adjustment of the convolution algorithm or the cache blocking parameters, which are evaluated performance- and energy-wise on an NVIDIA Jetson Xavier. Similarly, the authors in [17] present novel optimisation techniques based on layer separation and sparsification that are employed for wearable devices based on Qualcomm Snapdragon 400, ARM Cortex M0 and M3, and NVIDIA Tegra K1. Another approach for minimising resource requirements (computation, memory, and energy) is DeepX [18], a software tool that allows large-scale DL models to be executed efficiently on modern mobile processors. In this work, however, we focus on performing an exhaustive performance–energy evaluation of DL inference using a series of highly optimised convolution algorithms on ARM processors.

3 Convolution algorithms

The convolution operator

$$\begin{aligned} O = \textsc {conv}(F, I), \end{aligned}$$
(1)

receives a sequence of filter tensors, F, and 4-dimensional (4D) inputs, I, to produce 4D output tensors, O, where:

  • F comprises \(c_{o}\) filters of dimension \(h_{f}\times w_{f}\times c_{i}\) each, where \(h_{f}\times w_{f}\) correspond to the filters height \(\times\) width.

  • I consists of \(b\) input images of size \(h_{i}\times w_{i}\times c_{i}\) each, where \(h_{i}\times w_{i}\) denote the images height \(\times\) width, and \(c_{i}\) stands for the number of input channels.

  • O is composed of \(b\) outputs of size \(h_{o}\times w_{o}\times c_{o}\) each, where \(h_{o}\times w_{o}\) represent the outputs height \(\times\) width and \(c_{o}\) is the number of output channels.

The basic algorithm for the direct convolution in Listing 1 shows that each filter convolves a subtensor of the inputs, with the same dimension as the filter, to render a single scalar value (entry) for one of the \(c_{o}\) outputs. The filter is then repeatedly applied to the whole input, in a sliding window manner, to produce the complete entries of this single output [2].

figure a

In the following sections, we review four different methods to compute the convolution operator and our high-performance implementations for ARM processors. In particular, we target: (i) a blocked variant of the direct algorithm (ConvDirect), (ii) a lowering approach (Lowering), (iii) an implicit lowering approach (ConvGemm), and (iv) a Winograd implementation (ConvWinograd).

3.1 Blocked algorithm for direct convolution

In previous work [19], we combined the blocking strategy in [4] for the direct convolution algorithm with the packing schemes employed in the high-performance formulation of gemm [20]. The result was a new blocked version of the direct convolution, referred to as ConvDirect and illustrated by the algorithm in Listing 2, with the following properties:

  • All the arithmetic is enclosed inside a micro-kernel that computes a gemm to update a small \(m_r\times n_r\) micro-tile of the result (in this case, O), mimicking the high-performance implementations of gemm in GotoBLAS2, BLIS, OpenBLAS, and AMD AOML.

  • The micro-tile dimensions are decoupled from the cache blocking parameters \(w_{o,b}, c_{o,b}, c_{i,b}\).

  • The input tensor contents are packed into an \(m_c\times n_c\) buffer \(A_c\) to allow that its entries are accessed with unit stride from the micro-kernel. (For simplicity, the algorithm in Listing 2 only shows where this packing routine is placed.)

  • The filter tensor is re-packed into a 5D tensor, of dimension \(h_{f}\times w_{f}\times c_{o}/ c_{o,b}\times c_{i}\times c_{o,b}\). This type of packing enables unit-stride accesses to B from the micro-kernel. It should be noted that as the filters do not vary during inference, this only needs to be done once for the DNN model, and its cost becomes negligible.

A significant key to attaining high performance in the blocked direct convolution lies in the utilisation of an architecture-specific micro-kernel. The decoupling of the micro-tile dimensions from the cache blocking parameters combined with the packing of the input tensor facilitates leveraging existing high-performance micro-kernels, specifically tuned for a concrete processor architecture [19]. The advantage of our approach is to directly handle the well-adopted NHWC data layout, avoiding the tensor transformation overhead of previous algorithm designs [4].

figure b

Micro-kernels for the blocked direct convolution The evaluated direct convolution framework includes a number of micro-kernels specifically developed and tuned for the ARM Neon v8. These micro-kernels work with different micro-tile sizes, MR and NR, like 8\(\times\)12, 4\(\times\)12, 4\(\times\)16, 4\(\times\)20, etc. Among these, the 8\(\times\)12 micro-kernel is the one which generally attains the best performance. For this reason, we have selected this micro-kernel size for the performed ConvDirect experiments. It should be noted that this micro-kernel uses two different implementations. The first one is used when the micro-tile size is equal to MR \(\times\) NR (8\(\times\)12) and is programmed in ARM assembly language. The second implementation, which is called in those cases where the size of the micro-tile is smaller than MR \(\times\) NR, is programmed using ARM intrinsic instructions.

Parallelisation The blocked direct convolution presents a considerable number of independent loops, which offer a rich variety of parallelisation opportunities (loop-level parallelism) that can be exploited, for example, via OpenMP. In our case, we parallelised the loop traversing the \(c_{o,b}\) dimension (see Line 11 of Listing 2).

3.2 Lowering approach

A high-performance implementation of the convolution operator can be obtained for current computer architectures by lowering this operator into a large matrix-matrix multiplication (gemm). For this purpose, assuming the input/output tensors are stored following the NHWC layout (and the filters in the CRSK layout), the lowering approach:

  1. 1.

    Applies a row transformation to the 4D input tensor I in order to build an augmented 2D matrix A of size \(m \times k = (bh_{o}w_{o}) \times (c_{i}h_{f}w_{f})\) [7], as shown in the algorithm in Listing 3.

  2. 2.

    Computes the output of the convolution directly from the gemm \(C = A \cdot B\), where \(C \equiv O\) is the output tensor, viewed as an \(m \times n = (bh_{o}w_{o}) \times c_{o}\) matrix, and \(B \equiv F\) is the filter tensor, viewed as a \(k \times n = (c_{i}h_{f}w_{f}) \times c_{o}\) matrix.

We denote the combination of these two steps as an explicit im2row-based convolution. The lowering approach performs the same arithmetic operations as the direct convolution in Listing 1 and therefore has the same numerical properties.

figure c

Parallelising and vectorising the im2row transform. The im2row transform is a memory-bound transform which can be easily parallelised by adding the appropriate OpenMP directive to the most appropriate loop(s), preferably one of the outermost ones or a collapsed combination of them. In our case, we parallelised loops iterating over the batch size and input channels, though the parallelisation efficiency is limited when the multiple threads saturate the memory bandwidth while accessing to I and A or when the number of collapsed iterations is not big enough.

Its vectorisation, however, is not straightforward. This kernel only performs data movements between I and the workspace A, so we efficiently benefit from SIMD loads (e.g. via ARM NEON intrinsics) provided the innermost loop traverses the \(c_{i}\) dimension of the problem. However, SIMD stores are not possible to write the values to A.

High-performance gemm and lowering. On the positive side, the large dimension of the gemm appearing in the lowering approach favours an efficient vectorisation and exposes a high degree of loop-level parallelism for multicore ARM architectures. This can be achieved by invoking an existing high-performance implementation of gemm, such as that in the BLIS framework [20].

3.3 Implicit lowering approach

The negative side of the lowering approach described in the previous subsection is that it requires a large temporary workspace A (\(h_{f}w_{f}\) times larger than I) and presents a certain overhead due to the required data copies. To minimise these costs, instead of explicitly constructing the matrix A, it is possible to combine the im2row transform and the packing of A into the buffer \(A_c\) that occurs within the gemm BLIS [3]. For this purpose, during the execution of the gemm kernel, the buffer \(A_c\) is directly constructed from the contents of the input tensor I, instead of from the augmented matrix A. This approach never explicitly assembles A, obtaining large memory savings because \(A_c\) typically requires a few kilobytes of memory and is much smaller than A. The maximum size of A depends on the model, the data set, the batch size, and the data type being used. For example, for ResNet50, ImageNet, a batch size of 1, and using the float32 data type, the maximum size of A is 6.9 MiB, whereas with a batch size of 64 is 441 MiB. Our implementation for the implicit lowering convolution variant is denoted as ConvGemm.

3.4 Winograd minimal filtering algorithm

The Winograd (minimal filtering) [21] algorithm, referred to as ConvWinograd, provides a method to obtain an efficient implementation of a convolution operator [22]. Concretely, given a convolution layer that applies a filter f to an input image d, consisting of c input channels, in order to produce an output y, with k channels, the Winograd-based convolution can be expressed as

$$\begin{aligned} \begin{array}{l} y_{i_k} = A^T\left( \sum _{i_c=1}^{c} \left( G f_{i_k,i_c} G^T\right) \odot \left( B^T d_{i_c} B\right) \right) A,\quad i_k=1,2,\ldots ,k, \end{array} \end{aligned}$$
(2)

where GB, respectively, denote the transformation matrices for the filter and input matrices; A is the inverse transformation matrix; \(f_{i_k,i_c}\) is the \(i_c\)-th channel of the \(i_k\)-th filter; \(d_{i_c}\) is the \(i_c\)-th channel of the input image; \(y_{i_k}\) is the \(i_k\)-th channel of the output; and \(\odot\) denotes the Hadamard (or element-wise) multiplication [12].

From a practical point of view, the 2D Winograd-based convolution applies an \(r \times r\) filter to a \(t\times t\) input tile in order to produce an \(m \times m\) output tile, with \(t = m+r-1\). An \(h_i \times w_i\) image is processed by partitioning it into \(t \times t\) tiles, with an overlapping factor of \(r-1\) elements between neighbouring tiles, yielding \(\lceil h_i /m\rceil \lceil w_i/m\rceil\) tiles per channel. In this algorithm, choosing a larger value for m thus reduces the number of arithmetic operations, unfortunately at the cost of introducing numerical instability in the computation [23]. For that reason, m is usually set to be small, with two popular cases being \(F(m \times m, r \times r) = F(4 \times 4, 3 \times 3)\) and \(F(2 \times 2, 3 \times 3)\).

According to Winograd’s formula (2), the intermediate Hadamard products are summed over all c channels to produce the \(i_k\)-th output channel. In our Winograd implementation [24], we scatter each transformed tile of the filter and input along the \(t \times t\) dimensions, on respective intermediate workspaces U and V, of sizes \(t\times t \times k \times c\) and \(t \times t \times c \times (\lceil h_i/m\rceil \lceil w_i/m\rceil )\) in order to collapse the Hadamard products and the element-wise summations into \(t \times t\) independent matrix–matrix multiplications (also known as a batched gemm [25]). Finally, the same coordinates of the resulting \(t\times t\) matrices are gathered to form a new \(t\times t\) tile which is next used to compute the inverse transform as a \(m\times m\) tile on the output tensor.

In summary, the batched gemm variant of the Winograd algorithm exposes four major phases: 1) filter transform; 2) input transform; 3) batched gemm; and 4) output inverse transform. In DL, the 3D input/output tensors are extended with a batch dimension n using either the NCHW or the NHWC layouts.

OpenMP parallelisation In our implementation, the four phases of the algorithm are parallelised using OpenMP, as the kernels involved by the transform matrices for the filter/input/output tiles present no data dependencies. To augment loop-level parallelism, we also use the OpenMP collapse clause to fuse the first two loops in each phase. Each individual \(t\times t\) gemm kernel in phase 3 is executed serially, but we parallelise their calculation across the \(t \times t\) dimensions.

Vectorising the input transform The implementation of the Winograd filter/input/output transform phases is also vectorised using ARM NEON intrinsics. A specialised gemm for the filter/input/output tiles is implemented computing only the non-zero elements in the G, B, and A sparse transformation matrices. Given that these matrix operands remain static for all the computation, it is possible to hard-code their entries to directly operate with vector registers.

4 Experimental results

In this section, we assess the performance and energy efficiency of the described convolution algorithms on four different platforms based on ARM multicore CPUs with varying core/frequency configurations using the ResNet-50 v1.5 CNN model. In [16], we performed an exhaustive execution analysis of different algorithms and described a series of optimisations to reduce the inference time. In this paper, we leverage the optimal versions of these algorithms.

4.1 Hardware setup

For the experiments, we use a range of NVIDIA Tegra platforms [26], a SoC series for battery-operated devices such as smartphones, personal digital assistants, and mobile Internet devices. The Tegra SoCs integrate an ARM architecture CPU, a GPU, a north/southbridge, and a memory controller into a single package. These low-power SoCs, branded as NVIDIA Jetson, emphasise performance for gaming and machine/deep learning applications with a special focus on energy efficiency.

Table 1 describes the specifications of the four selected NVIDIA Jetson platforms: AGX Orin, AGX Xavier, Nano, and TX2. For each platform, the table details the CPU model, architecture, maximum operating frequency, levels of cache, memory type and size. Each of these systems permits the selection of different performance models, also known as NVP (NVIDIA Performance) models, allowing the user to enable/disable CPUs and to adjust the core operating frequency in order to limit the maximum power consumption. NVP models can be configured and monitored via the nvpmodel and jetson-stats tools [27], respectively. The performance models configured for each platform are detailed in Table 2. There, the Max-N mode enables all CPU cores and allows them to consume as much power as they require to achieve their maximum performance. In contrast, modes with IDs starting from 1 on (except for TX2) set a maximum power budget to be consumed, either by progressively disabling cores, reducing the operating frequency, or a combination of both. Note that the TX2 platform does not offer any kind of power capping in its available NVP models.

Table 1 Specifications for the selected NVIDIA Jetson platforms
Table 2 Selected NVP model configurations for the NVIDIA Jetson platforms

4.2 Power monitoring

The Jetson platforms offer several INA3221 on-board power sensors via I\(^2\)C, which can be monitored through the sysfs file system nodes [28]. These sensors measure power, voltage, and current rails available for each platform. Table 3 details the CPU-related rails that we monitor for the energy efficiency evaluation. These include the CPU and memory (DDR) power consumption rails but omit those related to the GPU and other on-board peripherals that have not been used in this study.

To collect measurements from these power rails, we leverage PMLib, a Power Measurement Library [29]. This library implements a client-server model where the server continuously reads power samples that can be later collected by the clients for a given time interval. To gather power measurements from each Jetson platform, the corresponding PMLib module reads the sysfs files related to the power rails listed in Table 3 at a frequency of 10 Hz.

Table 3 Measured CPU-related power rails for each system platform

4.3 DL framework, libraries, compilation flags, and parallelisation

To evaluate different convolution algorithms, we bundled their codes into individual C libraries and integrated them into PyDTNN, a lightweight framework implemented in Python for DL training and inference [6, 30]. For this purpose, we developed the corresponding binding modules that internally call the ConvDirect, ConvGemm, and ConvWinograd C functions via the ctypes Python library. These Python modules interact with the PyDTNN layer class Conv2D to finally execute the convolution algorithm. For the case of ConvWinograd, the binding module calls the Winograd C function according to the filter size requested by the convolutional layer encountered in a given neural network model. Note that the libraries for the ConvGemm and ConvWinograd algorithms, which internally execute the gemm or leverage a gemm-related micro-kernel, are linked against BLIS v0.8.1. Alternatively, the implementation of the Lowering algorithm within the Conv2D layers is implemented in PyDTNN using Cython v0.29.24 and parallelised with OpenMP. In this case, the implementation of gemm is provided by the same BLIS library. It is also worth noting that the three convolution libraries (ConvGemm, ConvDirect and ConvWinograd) have been compiled using gcc v10.2.0 with the optimisation flags -O3 -fopenmp for all the platforms.

4.4 Testbed

For the evaluation, we measure the inference throughput and energy efficiency in terms of images per second and images per Joule, respectively, of the ResNet-50 v1.5 [2] CNN model on the ImageNet dataset [31]. In all cases, the batch size is set to \(n=1\) to reflect the single-stream scenario of the ML Commons benchmark for inference on edge computing. Also, all the operations performed in the CNN inference experiments are carried out using FP32 arithmetic.

4.5 Performance and energy efficiency scalability

Figure 1 reports the inference throughput (left-hand column) and energy efficiency (right-hand column) for the four NVIDIA Jetson platforms with a varying number of threads and NVP models. Note that both throughput and energy efficiency figures were averaged from a total of 600 inferences using the ResNet-50 v1.5 model. From the performance point of view, we observe that increasing the operating frequency and the number of threads delivers the best throughput in general. This behaviour is common for all the platforms and algorithms, except for the ConvDirect on the Xavier, where the scalability is limited by the improper use of cache memories in a multithreaded scenario.Footnote 5 Leaving apart this outlying result, the algorithm that delivers in general the best performance in all configurations and platforms is ConvWinograd.

Fig. 1
figure 1

Performance and energy efficiency per convolution algorithm, number of threads and NVP model

Fig. 2
figure 2

Performance and energy efficiency with the best thread configuration for each algorithm and NVP model

Focusing on energy efficiency, we observe different trends depending on the selected NVP model, number of threads, and platform. The first observation is that the best energy efficiency is not always obtained by increasing the number of threads. The most energy-efficient configuration for Orin on the 50W model is to use only 8 threads of 12 for all algorithms except ConvDirect. For Xavier on the 30W-All model, the best choice is to set 4 cores instead of all 8. Similar to the previous performance study, the ConvDirect algorithm on Xavier and Orin obtains a lower energy efficiency as the number of threads increases. Again, this is due to the poor usage of the cache levels that the configuration of this algorithm delivers on these specific platforms. On the contrary, the increasing number of threads on the Nano and TX2 platforms does improve their energy efficiency. Therefore, we can conclude that these platforms are far more energy proportional than Orin and Xavier, as the energy consumed by the algorithm decreases when more cores (threads) are used.

4.6 Performance and energy consumption for the best parallel configuration

Figure 2 summarises the results shown in Figure 1 by reporting only those configurations with the number of threads that deliver the best throughput (left-hand column) and energy efficiency (right-hand column) per algorithm and NVP model. The numbers at the top of the bars indicate the optimal number of threads of each configuration. Focusing on performance, we can conclude, as already mentioned above, that increasing the NVP model (from model ID 1 on) in each device improves the performance of different algorithms. Besides, the algorithm with the best throughput is ConvWinograd with the Max-N model. This combination of algorithm and NVP model outperforms all other configurations, except on TX2 where the Max-P-ARM model obtains a behaviour similar to Max-N.

Concerning energy efficiency, the results are less predictable. Nevertheless, the most energy-efficient algorithm is still ConvWinograd. Instead, the best NVP model for each platform is 50W for Orin, 30W-All for Xavier, Max-N for Nano, and Max-P-Core-All for TX2.

Table 4 Optimal thread and NVP model configurations for best performance and energy efficiency

Table 4 reports the NVP model and the number of threads (cores) tuple configurations that achieve the best throughput and energy efficiency obtained by each convolution algorithm. The throughput and energy efficiency of the best algorithm for each platform are highlighted in bold.

All in all, we can conclude that if the aim is to maximise performance, the best option on all the platforms is to use the ConvWinograd algorithm and set the NVP model to Max-N with the maximum number of cores. In contrast, if the goal is to reduce the energy footprint, there is no rule of thumb, as each platform has its own optimal configuration.

5 Concluding remarks

We have performed an exhaustive evaluation of the performance and energy efficiency attained by the selected convolution algorithms for DL inference on a collection of low-power ARM-based processor architectures. In particular, this analysis leverages a collection of NVIDIA Jetson platforms based on Orin, Xavier, Nano, and TX2, comprising different models of ARMv8-A-/ARMv8.2-based multicore processors.

Regarding the experimental results, we can conclude that for raw DL inference performance, the most suitable configuration consists of scaling the processors to the highest available frequency and using the maximum number of cores. However, this configuration will only achieve great results provided the convolution algorithm delivers fair strong scaling and makes efficient use of the cache levels. In contrast, if the goal consists in reducing the energy footprint, we find no clear winner for an optimal configuration, as depending on the platform, algorithm, the number of cores, and operating frequency, we can find different outcomes. We relate these effects to the distinct energy efficiencies and energy proportionality grades of the assessed Tegra SoC devices for the evaluated convolution algorithms.

As part of the future work, we plan to complement the performance–energy trade-off analysis with high-end processors, such as the Fujitsu A64FX ARMv8.2-A+SVE, and MCUs, equipped with ARM Cortex-M and ESP32 CPUs, e.g. the Arduino Nano 33 BLE Sense or the Espressif ESP32-EYE devices. To reproduce this experimentation on those platforms, we plan to develop external power–energy meters providing high-resolution measurements.