1 Introduction

The general matrix multiplication (gemm) is a key computational kernel on top of which a significant part of the basic linear algebra subprograms (BLAS) [1] is built [2, 3]. In addition, gemm plays a fundamental role for convolutional (deep) neural networks that are prominent in computer vision tasks, as well as for transformers that are currently used in natural language processing [4, 5]. For these reasons, it is natural that considerable effort has been spent over the past decades to optimise gemm in basically any computer architecture.

The performance of gemm is strongly determined by the efficiency of a small architecture-specific component, known as the micro-kernel [6,7,8]. Most modern instances of BLAS contain a single micro-kernel per processor architecture, usually encoded in assembly by a computer architecture expert. However, the benefits of choosing among multiple micro-kernels have been illustrated for deep learning in [9] and for dense linear algebra, as well as scientific computing in general in [10].

This paper contributes towards the development of optimised versions of gemm by presenting two methods to automatically generate competitive micro-kernels for ARM NEON (v8.2) processors equipped with single-instruction multiple-data (SIMD) vector units. This is especially interesting for processors from the same family that do not yet have a tuned instance of gemm. In more detail, our work makes the following specific contributions:

  • Initially, we generalise the initial solution proposed in [9] to leverage C++ templates in order to produce micro-kernels, based on vector intrinsics, at compilation time. This allows to deal more efficiently with “corner” cases that arise when the architecture cache configuration parameters are not integer multiples of the micro-kernel dimensions. In addition, the adoption of templates eases the generation of code for distinct data types using a single generator.

  • Next, we take one step forward to directly produce assembly micro-kernels, using Python scripts. Compared with the previous solution, this method presents the advantage of enforcing a given order of the micro-kernel instructions that the compiler will not change.

  • For three distinct ARM-based architectures and a collection of representative problem instances arising from practical deep learning applications, we demonstrate that a gemm routine that integrates the automatically generated micro-kernels provides competitive performance, on par or even outperforming the realisation of gemm in highly tuned linear algebra libraries such as BLIS, OpenBLAS and ArmPL

At this point, it is worth noting that our work targets ARM NEON processors, yet the Python scripts can be easily adapted to target ARM SVE, AMD AVX2, Intel AVX512 or even the RISC-V vector extension.

The remainder of the paper is structured as follows: In Sect. 2, we briefly review the modern implementation of gemm for current processors, with SIMD vector units and a multilayered memory hierarchy. In Sects. 3 and 4, we describe the two automatic generators introduced in this paper, respectively, relying on vector intrinsics and assembly. In Sect. 5, we evaluate the performance of the different solutions, and finally, in Sect. 6, we close the paper with the conclusion.

figure l

2 Baseline implementation of GEMM

Fig. 1
figure 1

Baseline algorithm for gemm (top), data transfers across the memory hierarchy (bottom-left), and packing (bottom-right)

Consider the gemm \(C = C + AB\), where A, B and C are matrices of dimensions \(m \times k\), \(k \times n\) and \(m \times n\), respectively. Modern high-performance instances of this computational kernel, for conventional processor architectures with deep memory hierarchies, follow GotoBLAS [6] to encode it as five nested loops comprising two packing routines and a micro-kernel. Furthermore, for processors with SIMD vector units, the micro-kernel consists of an additional loop that performs an outer product per iteration. Figure 1 (top) displays the baseline algorithm for gemm, identifying the six loops, the two packings, and the micro-kernel. Portable realisations of gemm encode the five outermost loops and the two packing routines of the baseline algorithm in a high-level programming language such as C. In contrast, for high performance, the micro-kernel is an architecture-specific piece of code, usually encoded in assembly.

Cache hierarchy The three outermost loops of the baseline algorithm partition the matrix operands conformal to the processor cache hierarchy. This specific nesting of the loops, together with a proper packing of AB (see the bottom right plot in Fig. 1) plus a careful selection of the cache configuration parameters \(m_{c}, n_{c}, k_{c}\) [11], favour that, during the execution of the micro-kernel, the buffers \( A_{c}, B_{c}\), respectively, remain in the L2 and L3 cache memories.

Micro-kernel for SIMD processors The micro-kernel streams an \(m_{r} \times n_{r}\) micro-tile \(C_{r}\) of C from the main memory into the processor registers; an \(m_{r} \times k_{c}\) micro-panel \(A_{r}\) of \(A_{c}\) from the L2 cache; and a \(k_{c} \times n_{r}\) micro-panel \(B_{r}\) of \(B_{c}\) from the L1 cache; see the bottom-left plot in Fig. 1. Packings also ensure that the contents of buffers \(A_{r}, B_{r}\) are accessed with a unit stride during the execution of the micro-kernel, enabling the use of vector instructions to retrieve their contents. The arithmetic-to-memory access ratio (or arithmetic intensity [12]) of the micro-kernel is given by

$$\begin{aligned} \frac{2m_{r}\,n_{r}\,k_{c}}{(m_{r} +n_{r})\,k_{c}} =\frac{2m_{r}\,n_{r}}{m_{r} + n_{r}} \quad \text {flops}\,\text {per}\,\text {memory}\,\text {access}.\end{aligned}$$

Choosing large values for \(m_{r} \approx n_{r}\) thus maximises this ratio. For the same reason, it is convenient to maximise the use of vector registers, without incurring into register spilling [11], by ensuring that

where \({\texttt {vl}}\_{\texttt {fp}}\) denotes the number of floating point numbers (elements) that fit into a single vector register and \({\texttt {vr}}\) is the number of vector registers.

Parallelism The multi-threaded parallelization of gemm has been analysed for conventional multicore processors, modern many-threaded architectures, and asymmetric ARM-based processors in [7, 13, 14]. In those works, the parallelization of gemm is approached by extracting parallelism from any of the loops L1, L3, L4, L5, or a combination of them. (Loops L2 and L6 are not convenient as they introduce race conditions.)

The parallelization technique is rather orthogonal to the micro-kernel: As the parallelization targets one (or more) of the five outermost loops of gemm (L1L5, except L2), and the micro-kernel only comprises the sixth loop (L6), any of the micro-kernels proposed in this work can be combined with the parallel approaches proposed in the literature. In consequence, and in order to keep the paper focused on the generation of the micro-kernel, we will omit the analysis of parallelism in the following.

3 ARM NEON micro-kernels for GEMM using vector intrinsics

In this section, we pursue the development of architecture-specific SIMD-based micro-kernels for ARM processors using vector intrinsics. However, instead of a manual development process, we demonstrate that it is feasible to generate the micro-kernels automatically, significantly easing this task while delivering fair performance.

For simplicity, hereafter we choose the 32-bit floating point (FP32) as the basic data type for all the routines/codes presented in the section (in C language, float). Furthermore, we target ARM NEON v8.2, for which the vector length is 128 bits (i.e. 16 bytes) so that, for FP32, vl_fp32=4 (FP32 numbers per vector register). The same ideas for automatic generation apply to other data types and SIMD-enabled processor architectures.

3.1 A simple generic micro-kernel

In [9], we took a significant step forward toward improving the portability and maintainability of the BLAS by introducing a “generic” (i.e. multiplatform) scheme that relies on C macros to abstract the vector data type and the basic vector intrinsics for load, store, and axpy (scalar \(\alpha \) times x plus y) update. In addition, when supported by the compiler, the generic micro-kernel can accommodate a micro-tile of any dimension \(m_{r} \times n_{r}\) using arrays of vector registers.

figure a

Listing 1: Generic micro-kernel that operates with an mr ×nr micro-tile using vector intrinsics.

figure b

Listing 2: C macros to customise the generic micro-kernel for ARM NEON (v8.2) and FP32.

In order to present the solution in [9], assume for simplicity that \(m_{r}, n_{r}\) are both integer multiples of vl_fp32. We then need \(m_{v} \, n_{r} = (m_{r}/{\texttt {vl}}\_{\texttt {fp}}32) \, n_{r}\) vector registers to store the micro-tile \(C_{r}\); \(m_{v}\) for a single column of \(A_{r}\); and \(n_{v} = n_{r}/{\texttt {vl}}\_{\texttt {fp}}32\) for a single row of \(B_{r}\). Listing 1 displays our original generic micro-kernel, where we highlight a couple of details:

  • Prior to the main loop, indexed by kr (line 25), we load the contents of the \(m_{r} \times n_{r}\) micro-tile of C into the array of vector registers Cr via two nested loops (lines 21–23). The transfer in the opposite direction, from the vector registers back to the memory, is carried out after the main loop (lines 40–42).

  • At each iteration of the main loop, a column of \(m_{r}\) elements of \(A_{r}\) and a row of \(n_{r}\) elements of \(B_{r}\) are first loaded into the appropriate vector registers (lines 27–28 for ar and lines 29–30 for br, respectively). These elements then participate in the update of the micro-tile stored in Cr (lines 33–36).

The generic micro-kernel in Listing 1 is customised for ARM NEON (v8.2) and FP32 using the C macros displayed in Listing 2.

3.2 Evolving the generic micro-kernel

For the gemm baseline algorithm, when \(m_{c}, n_{c}\) are not integer multiples of \(m_{r}, n_{r}\), respectively, a significant benefit can be obtained by developing specialised micro-kernels which employ SIMD instructions to update the “corner” cases. To illustrate this, imagine we have adopted a \(8 \times 8\) micro-kernel as the baseline. Unfortunately, we will encounter certain cases where it is necessary to process a micro-tile of smaller dimensions, for example \(4 \times 8\). In that particular case, it may be more efficient to employ a \(4 \times 8\) micro-kernel with vector instructions to update this smaller micro-tile than to employ a scalar (i.e. non-SIMD) routine the baseline \(8 \times 8\) micro-kernel. Furthermore, the corner cases where the any or both micro-tile dimensions are not integer multiples of the vector register length (e.g. \(3 \times 7\)) can be dealt via a micro-kernel of dimensions immediately superior that are integer multiples (e.g. \(m_{r} \times n_{r}=4 \times 8\)), exploiting the fact that the buffers \(A_{c}, B_{c}\) will accommodate this excess in the dimensions and using scalar instructions to load and store only the necessary elements of \(C_{r}\).

figure c

Listing 3: Generic micro-kernel that leverages C++17 templates to operate with any mr × nr micro-tile using vector intrinsics.

To address these scenarios, the top part of Listing 3 presents an enhanced version of the generic micro-kernel in Listing 1 that utilises C++ templates to facilitate the generation of a collection of micro-kernels for any combination of \(m_{r}\) and \(n_{r}\). To achieve this, a set of auxiliary template functions are responsible for unrolling the micro-kernel loops using recursive integer template parameters and constant conditional (static-if) expressions specific to the C++17 standard (see Listing 10 in the appendix for details). During compilation, these functions are evaluated to generate the appropriate instructions within the loop body based on the values of \(m_{r}\) and \(n_{r}\). For instance, in the main function depicted at the bottom part of Listing 3, we instantiate gemm micro-kernels for sizes \(4 \times 4\), \(4 \times 8\), and \(8 \times 4\).

In conclusion, as in the case of the initial generic micro-kernel, this template-based version produces customised code for any architecture. It also accommodates the generation of code for a family of micro-kernels of different sizes at compile time.

4 ARM NEON micro-kernels for GEMM using assembly

In this section, we also address automatic generation of micro-kernels, but now targeting architecture-specific SIMD-based routines for ARM NEON processors using assembly instead of vector intrinsics.

4.1 Simple micro-kernels using ARM NEON assembly

In Listing 4, we show a simple micro-kernel of dimension \(m_{r} \times n_{r} = 4 \times 4\) for FP32 data and encoded using ARM NEON assembly. The routine receives the same five parameters as its counterpart with vector intrinsics in Listing 1: The dimension \(k_{c}\); address pointers to \(A_{r}\), \(B_{r}\) and C; and the leading dimension ldC. Note the connection between the two versions of the micro-kernel, emphasised with the use of the same comments for those blocks of the two codes that perform analogous functions. The assembly routine proceeds as follows:

  • From the address pointer C (\(\equiv \) C00) to the appropriate entry of the matrix C, the routine initialises the pointers C01, C02, C03 to the remaining three columns of the \(4 \times 4\) micro-tile taking into account the matrix column stride (lines 9–11).

  • Prior to the main loop, four columns of C, each consisting of four FP32 numbers, are loaded into four vector registers: Crq00Crq03, using the assembly SIMD instruction ldr, which has a analogous function to that of the vector intrinsics vld1q_f32 (lines 13–16). After the loop, the contents of these vector registers are stored back into the matrix C using the assembly SIMD instruction str, with an analogous function to that of the vector intrinsics vst1q_f32 (lines 33–36). Note that CrqXY and CrvXY refer to the same register, but are referenced depending on whether the register is, respectively, involved in a memory (load/store) instruction or an arithmetic operation.

  • At each iteration of the main loop, the routine loads one column of \(A_{r}\) into a vector register arq0 (line 19), one row of \(B_{r}\) into a vector register brq0 (line 21), and updates Crv00Crv03 via four AXPYs (lines 24–27) using the assembly SIMD instruction fmla, which performs a vector fused multiply-add (functionally equivalent to the vector intrinsic vfmaq_laneq_f32.

figure d

Listing 4: Micro-kernel that operates with a 4×4 micro-tile using ARM NEON assembly.

figure e

Listing 5: Macros for the micro-kernel using ARM NEON assembly.

The \(4\times 4\) micro-kernel in Listing 4 exhibits a regular structure that is possible to generalise to many other dimensions. (This is demonstrated, for example, with the excerpt of code in Listing 11 provided in the appendix, which corresponds to the main loop of a \(12 \times 8\) assembly micro-kernel for ARM NEON and FP32 data.) Concretely, we identify the following characteristics which are independent of the micro-kernel dimensions:

  1. 1.

    The contents of the micro-tile are retrieved from memory into vector registers prior to the loop and written from there back to memory after it.

  2. 2.

    The micro-kernel employs \(m_{v} \times n_{r}\) vector registers to keep the contents of the \(m_{r} \times n_{r}\) micro-tile.

  3. 3.

    At each iteration, the loop loads the contents of one column of \(A_{r}\) and one row of \(B_{r}\), and then uses them to update the micro-tile once.

This similarity is the basis that motivates the automatic generation of assembly micro-kernels.

4.2 A python generator of micro-kernels using ARM NEON assembly

A basic generator The regularity of the basic micro-kernels can be leveraged to automatically elaborate their code using the Python routine in Listing 6. (Indeed, the ARM NEON assembly codes for the \(4 \times 4\) and \(12 \times 8\) micro-kernels in Listings 4, 5, and 11 were obtained using this generator.) Inspecting the instructions of the generator, we can easily identify the different parts that produce the code fragments for the load/store of C, \(A_{r} \), \(B_{r}\), and the arithmetic. Note that this generator assumes that C is stored in column-major order. In case C is stored in row-major order, we can still use the same micro-kernel by swapping the roles of \(A_{r}\) and \(B_{r}\) and adjusting the leading dimension of C accordingly.

figure f

Listing 6: Micro-kernel generator that operates with an mr × nr micro-tile using ARM NEON assembly.

The simple generator in Listing 6 builds a micro-kernel that operates with an \(m_{r} \times n_{r} \) micro-tile of C, assuming there are enough vector registers for this. This can result in a compilation error if the number of utilised vector registers exceeds the maximum. This would be the case, for example, of a \(16 \times 8\) micro-kernel, which would require 32 vector registers for the micro-tile of C, 4 for the column of \(A_{r}\), and 2 for the row of \(B_{r}\), for a total of 38. In the actual generator, this type of situations are avoided with a simple logic test.

Automatic generation of advanced micro-kernels The basic generator has been extended to produce more sophisticated micro-kernels enhanced with advanced techniques such as loop unrolling and software pipelining [15]. For example, the former can be accommodated in the \(4 \times 4\) micro-kernel using the macro in Fig. 7, which comprises the loads of \(A_{r}, B_{r}\) and the arithmetic in Listing 4 (lines 19–27).

figure g

Listing 7: Macros for the 4 × 4 micro-kernel.

We can then generate the main loop of the micro-kernel with an unrolling factor of 4 by replicating the loop body that number of times, via the macro, as shown in Listing 8. For simplicity, we do not show here how to extend the code for the cases where \(k_{c}\) is not an integer multiple of 4.

figure h

Listing 8: Main loop of the 4 × 4 micro-kernel unrolled with a factor 4.

A complementary technique that can be integrated in the automatic generator is Software pipelining is a complementary technique which, during an iteration, pre-loads the data that will be utilised in the “next” iteration of the main loop, separating these memory accesses from the arithmetic operations where the data is utilised. The excerpt of code in Listing 9 combines software pipelining and loop unrolling with a factor of 4. Again, for simplicity, we do not discuss the code required to cover the final iterations or the cases where \(k_{c}\) is not an integer multiple of 4.

figure i

Listing 9: Main loop of the 4 × 4 micro-kernel unrolled with a factor 4 and with software pipelining.

Dealing with “corner” cases Our Python generator for assembly micro-kernels is not oblivious to corner cases that were already discussed in the case of vector intrinsics. Indeed, the Python generator takes this into account and, when asked to produce an \(m_{r} \times n_{r}\) micro-kernel, it actually builds the requested one plus a full collection of smaller micro-kernels to tackle other micro-tile dimensions. In addition, there is a complete logic that is integrated into the gemm routine and invokes the micro-kernel that better matches the specific dimensions of each corner case.

5 Experimental evaluation

In this section, we assess the performance of the gemm realisations embedding the automatically generated micro-kernels. For reference, we include in the comparison an evaluation of the gemm in optimised instances of BLIS, OpenBLAS and ArmPL for the target platforms.

5.1 Problem cases

Much of the interest of our work lies in the fact that the matrix multiplication kernel is the backbone of the convolution operation, once the im2col (or im2row) transform casts this operator into a gemm. The convolution is found in well-known neural network layers for signal processing (including computer vision) and, moreover, bears most of the computational weight of model execution. For example, in [16] we report that the convolution layers in the ResNet-50 v1.5 model combined with ImageNet can consume between 45% and 87% of the inference time, depending on the optimisations that are applied. Thus, given the interest in deploying deep learning technologies, the dataset for the experimentation here includes matrix multiplications with their dimensions determined by the application of im2col to the convolution layers in the neural networks ResNet-50 v1.5 [17] and GoogleLeNet [18], combined with the ImageNet dataset. In the experiments, the batch size is set to 1 sample, reflecting a latency-oriented scenario [16, 19].

5.2 Hardware setup

In the evaluation, we target the following three ARM-based development platforms:

  • An NVIDIA Cortex-A78AE processor, embedded in the NVIDIA Jetson AGX Orin board, with a 64-KB L1 data cache, a 256-KB L2 cache, a 2-MB L3 cache, and a 32-GB LPDDR5 memory.

  • An NVIDIA Carmel processor in the NVIDIA Jetson AGX Xavier platform, with a 64-KB L1 data cache, a 2-MB L2 cache, a 4-MB L3 cache, and a 16-GB LPDDR4x memory.

  • An NVIDIA Cortex A57 processor, in the NVIDIA Jetson Nano board, with a 32-KB L1 data cache, a 2-MB L2 cache, and a 4-GB LPDDR4 memory.

These target systems, listed from highest to lowest computational power, are representative of the type of equipment that can be used to run machine learning inference workloads.

In order to reduce variability in the experiments, the frequency of the processor cores is fixed in all cases. A single core is employed in the three architectures, with a thread bound to it. All experiments are carried out in IEEE FP32 arithmetic, and they are repeated a large number of times, reporting the average results. Performance is measured in terms of billions of floating point operations per second (GFLOPS) or, in the final part of the section, in execution time (s).

5.3 Software setup

We focus on the performance gains that can be obtained when leveraging specific micro-kernels for the convolution operators in the ResNet and GoogleLeNet neural networks. The goals are to show the performance obtained in each layer with our gemm using the best micro-kernel for that layer and to demonstrate that, by choosing the appropriate computational kernel, i.e. the gemm with the appropriate micro-kernel dimensions and optimisation techniques, it is possible to obtain performance similar, or even superior in many cases, to that offered by the implementation of gemm in optimised libraries. Concretely, for reference, the comparison includes data for the gemm realisations contained in BLIS (version v0.8.1) [13], OpenBLAS (version v0.3.19) [8], and ArmPL (version v21.1) [20].

Fig. 2
figure 2

Performance in GFLOPS of each representative convolutional layer of neural networks ResNet-50 v1.5 (top) and GoogleLeNet (middle and bottom) on Nvidia Jetson AGX Orin

Fig. 3
figure 3

Performance in GFLOPS of each representative convolutional layer of neural networks ResNet-50 v1.5 (top) and GoogleLeNet (middle and bottom) on Nvidia Jetson AGX Xavier

Fig. 4
figure 4

Performance in GFLOPS of each representative convolutional layer of neural networks ResNet-50 v1.5 (top) and GoogleLeNet (middle and bottom) on Nvidia Jetson AGX Nano

5.4 Performance per layer

Figures 2, 2 and 4 report the performance of the five gemm implementations on the three platforms, for the individual convolutional layers present in the two convolutional neural networks (CNNs). For the two most powerful platforms, NVIDIA Jetson AGX Orin and Xavier, the results are similar, with our automatically generated micro-kernels being, in a large majority of layers, among the top-3 best options; and in many cases, offering the best choice. The results are quite different for the NVIDIA Jetson Nano, where BLIS and OpenBLAS present superior performance. In consequence, we comment these two cases separately.

NVIDIA Jetson AGX Xavier/Orin As the number of results is large, we will focus our comments on one scenario that we believe is representative of the remaining cases on these two platforms and both CNN models. In particular, we describe in detail the outcome of the execution of the convolutional layers in ResNet-50 v1.5 on the NVIDIA Jetson AGX Xavier; see the top plot in Fig. 3. The results there show that the pair of solutions that integrate our automatically generated micro-kernels (labeled as Autogen-ASM and Autogen-Templates) outperform the library-based implementations (labeled as BLIS, OpenBLAS, and ARMPL) for layers #2 to #5, #7, #10 to #12, #15, #16 (10 cases out of 20, that is, 50%). In contrast, ARMPL is the best option for layers #13, #17, #18 (3 cases out of 20, 15%). For layers #1, #6, #8, #9, #14, #20 (6 cases, 30%), the best performance is attained by Autogen-ASM/Templates and ARMPL, with little differences between them. Finally, in one layer, #19, Autogen-ASM/Templates and BLIS offer similar performance, superior to that of the other alternatives. In summary, our automatically generated codes deliver the highest performance in all except three layers, where they are outperformed by ARMPL.

In order to characterise these results, we first need to link them with the dimensions of the gemm associated with each layer; see Table 1. Depending on the value of m, a classification can be established into four groups of problems: large, medium, small, and tiny. This clustering offers a characterisation of performance for the two groups in the middle. Concretely, for medium m and \(k\ge 512\), Autogen-ASM/Templates offers the best performance while, for the same group and smaller k, the performance of that option is similar to that of ARMPL. Similarly, for small m and large \(k (\ge \)1,024) Autogen-ASM/Templates is again the best but its performance tends to decay with respect to ARMPL as k decreases. In general, for large m the best automatically generated micro-kernel is \(m_{r} \times n_{r}=20 \times 4\), moving towards other variants with smaller \(m_{r}\) (\(16 \times 4\), \(12 \times 8\), \(8 \times 12\)) as m becomes also smaller.

Table 1 Dimension of the gemm problems associated with the convolutional layers in ResNet-50 v1.5+ImageNet for batch size \(b=1\)

Explaining the different behaviour of the gemm instances requires a careful case-by-case analysis as it is the consequence of a combination of factors related to the micro-kernel, that we revise in the following list:

  • Micro-kernel. The dimensions of the micro-kernel and the ratio \(m_{r}/n_{r}\) determine its arithmetic intensity [12] and, therefore, its performance under ideal conditions. For example, in a separate experiment, we could determine that, on the NVIDIA Jetson AGX Xavier, the larger micro-kernels automatically generated with our tool (\(m_{r} \times n_{r}\)= 8\(\times \)12, 12\(\times \)8, 4\(\times \)16, 4\(\times \)20, 16\(\times \)4, and 20\(\times \)4, are compute-bound while the smaller ones are bounded by the L2 cache bandwidth.

  • Dimensions \(n_{r}\) and k. According to the GotoBLAS scheme, an \(k_{c} \times n_{r} \) micro-panel of B should populate a significant fraction of the L1 cache, proportional to the ratio \(n_{r}/ (m_{r} + n_{r})\) [11]. Now, looking to the dimension k of the problems in Table 1, we observe that \(k_{c} \le k\) and, therefore, a large value \(n_{r} \), which depends on that dimension of the micro-kernel, can yield a more efficient use of L1 cache even when \(k_{c} =k\) is small.

  • Packing routines. Fourth, the packing routines help to reduce cache eviction but, at the same time, introduce a certain overhead, which can be significant in case there is not enough reuse of the buffers \(A_{c}, B_{c}\). As the matrices A/B can be stored in either row- or column-major order, and they have to be packed into narrow micro-panels, respectively, of \(m_{r} \) rows/\(n_{r} \) columns, the dimension of the micro-kernel interacts with the storage format to impact the cost of the packing procedures.

  • Edge cases. The cache configuration parameters \(m_{c}, n_{c}, k_{c} \), and the micro-kernel dimensions \(m_{r},n_{r} \) decompose the gemm problem into a collection of packing operations and micro-kernels of different sizes. When m is not an integer multiple of \(m_{c} \) and/or the latter is not an integer multiple of \(m_{r} \), these edge cases could benefit from specialised routines, which are not always efficiently implemented. The same applies to the trio \(n,n_{c}, n_{r}\); and the pair \(k,k_{c} \).

Note that in this list we have omitted the cache configuration parameters. These are set differently for each library/platform and obviously play a role on the performance as they have a direct impact on the utilisation of the cache hierarchy. In summary, explaining how all these factors interact to determine the overall performance of a specific gemm implementation, for a particular problem dimension, is very interesting, but also quite complex, especially for sophisticated libraries implemented by others.

NVIDIA Jetson Nano. The performance of the five routines on this system show a wider variability of the best option. Our gemm routine does not stand out as optimal yet, in general, it does not lose track with respect to the best option. Concretely, we find layers for which some routines, like ARMPL, perform well for many layers (#7, #20 of ResNet-50 v1.5) but quite poorly for others (layers #12, #16). OpenBLAS is better tuned for the core in Nano than for that in Xavier. The regularity in performance of Autogen-ASM is one of the most remarkable aspects of our proposal compared with the use of optimised libraries.

The superior performance of BLIS and OpenBLAS for the NVIDIA Jetson Nano is mainly due to a couple of factors: 1) BLIS and OpenBLAS contain micro-kernels with extensive use of hardware prefetching; and 2) BLIS and OpenBLAS provide vectorised version of the packing routines. From our observations, while these two factors have no major effect on the two other platforms, on a resource-constrained system such as the NVIDIA Jetson Nano, they explain the superior performance of the instances of gemm in these libraries.

5.5 Aggregated time

The GFLOPS rate provides a normalised metric to evaluate the performance of the different implementations of gemm, but it does not reveal the contribution of the layers to the total execution time and therefore the relevance of the differences. The final experiment, with results shown in Figs. 5, 6, and 7, shows the aggregated time on the two CNN models and the three platforms. To reflect a realistic execution, we report the execution time for all the convolution layers in the CNN models, not only those with different dimensions, as was the case of the previous experiments in this section.

The results show that, when comparing the option with automatically generated micro-kernels to the best library-based solution, the overall gain for the NVIDIA Jetson AGX Orin is small, between 2% and 3% depending on the CNN model; it is larger for the NVIDIA Jetson AGX Xavier, between 8% and 12%. Finally, in the NVIDIA Jetson Nano, the loss is between 6% and 10%.

Fig. 5
figure 5

Aggregated time of the convolutional layers of ResNet-50 v1.5 (top) and GoogleLeNet (bottom) on the NVIDIA Jetson AGX Orin

Fig. 6
figure 6

Aggregated time of the convolutional layers of ResNet-50 v1.5 (top) and GoogleLeNet (bottom) on the NVIDIA Jetson AGX Xavier

Fig. 7
figure 7

Aggregated time of the convolutional layers of ResNet-50 v1.5 (top) and GoogleLeNet (bottom) on the NVIDIA Jetson Nano

6 Conclusion

We have proposed two approaches to automatically generate gemm micro-kernels that mimic the encoding effort done by an expert, relieving this programmer from a significant part of the effort required in (the initial steps of) this error-prone task. Concretely, our generators produce either a C code with vector intrinsics or directly an assembly routine, for any data type and micro-kernel dimensions. Furthermore, they integrate high performance techniques such as loop unrolling and software pipelining.

Our experimental study shows the benefits of the automatic solution in comparison with optimised implementations of gemm in state-of-the-art libraries for three ARM-based processors and a representative collection of problem instances. The possibility of dynamically generating a family of micro-kernels, choosing the most efficient one as a function of the problem dimension, is demonstrated to be key to outperforming the static implementation of gemm in this libraries, which only include a single micro-kernel per architecture.