Automatic generation of ARM NEON micro-kernels for matrix multiplication

Alaejos, Guillermo; Martínez, Héctor; Castelló, Adrián; Dolz, Manuel F.; Igual, Francisco D.; Alonso-Jordá, Pedro; Quintana-Ortí, Enrique S.

doi:10.1007/s11227-024-05955-8

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Open access
Published: 12 March 2024

Volume 80, pages 13873–13899, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Download PDF

Guillermo Alaejos¹,
Héctor Martínez²,
Adrián Castelló¹,
Manuel F. Dolz³,
Francisco D. Igual⁴,
Pedro Alonso-Jordá¹ &
…
Enrique S. Quintana-Ortí¹

491 Accesses
Explore all metrics

Abstract

General matrix multiplication (gemm) is a fundamental kernel in scientific computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel that is usually encoded in assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert. In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data type, and easily generate micro-kernels of any requested dimension. The performance of this solution is tested on three ARM-based cores and compared with state-of-the-art libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results show that the auto-generation approach is highly competitive, mainly due to the possibility of adapting the micro-kernel to the problem dimensions.

Micro-kernels for portable and efficient matrix multiplication in deep learning

Article Open access 14 December 2022

Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Article 22 March 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The general matrix multiplication (gemm) is a key computational kernel on top of which a significant part of the basic linear algebra subprograms (BLAS) [1] is built [2, 3]. In addition, gemm plays a fundamental role for convolutional (deep) neural networks that are prominent in computer vision tasks, as well as for transformers that are currently used in natural language processing [4, 5]. For these reasons, it is natural that considerable effort has been spent over the past decades to optimise gemm in basically any computer architecture.

The performance of gemm is strongly determined by the efficiency of a small architecture-specific component, known as the micro-kernel [6,7,8]. Most modern instances of BLAS contain a single micro-kernel per processor architecture, usually encoded in assembly by a computer architecture expert. However, the benefits of choosing among multiple micro-kernels have been illustrated for deep learning in [9] and for dense linear algebra, as well as scientific computing in general in [10].

This paper contributes towards the development of optimised versions of gemm by presenting two methods to automatically generate competitive micro-kernels for ARM NEON (v8.2) processors equipped with single-instruction multiple-data (SIMD) vector units. This is especially interesting for processors from the same family that do not yet have a tuned instance of gemm. In more detail, our work makes the following specific contributions:

Initially, we generalise the initial solution proposed in [9] to leverage C++ templates in order to produce micro-kernels, based on vector intrinsics, at compilation time. This allows to deal more efficiently with “corner” cases that arise when the architecture cache configuration parameters are not integer multiples of the micro-kernel dimensions. In addition, the adoption of templates eases the generation of code for distinct data types using a single generator.
Next, we take one step forward to directly produce assembly micro-kernels, using Python scripts. Compared with the previous solution, this method presents the advantage of enforcing a given order of the micro-kernel instructions that the compiler will not change.
For three distinct ARM-based architectures and a collection of representative problem instances arising from practical deep learning applications, we demonstrate that a gemm routine that integrates the automatically generated micro-kernels provides competitive performance, on par or even outperforming the realisation of gemm in highly tuned linear algebra libraries such as BLIS, OpenBLAS and ArmPL

At this point, it is worth noting that our work targets ARM NEON processors, yet the Python scripts can be easily adapted to target ARM SVE, AMD AVX2, Intel AVX512 or even the RISC-V vector extension.

The remainder of the paper is structured as follows: In Sect. 2, we briefly review the modern implementation of gemm for current processors, with SIMD vector units and a multilayered memory hierarchy. In Sects. 3 and 4, we describe the two automatic generators introduced in this paper, respectively, relying on vector intrinsics and assembly. In Sect. 5, we evaluate the performance of the different solutions, and finally, in Sect. 6, we close the paper with the conclusion.

2 Baseline implementation of GEMM

Consider the gemm $C = C + AB$, where A, B and C are matrices of dimensions $m \times k$, $k \times n$ and $m \times n$, respectively. Modern high-performance instances of this computational kernel, for conventional processor architectures with deep memory hierarchies, follow GotoBLAS [6] to encode it as five nested loops comprising two packing routines and a micro-kernel. Furthermore, for processors with SIMD vector units, the micro-kernel consists of an additional loop that performs an outer product per iteration. Figure 1 (top) displays the baseline algorithm for gemm, identifying the six loops, the two packings, and the micro-kernel. Portable realisations of gemm encode the five outermost loops and the two packing routines of the baseline algorithm in a high-level programming language such as C. In contrast, for high performance, the micro-kernel is an architecture-specific piece of code, usually encoded in assembly.

Cache hierarchy The three outermost loops of the baseline algorithm partition the matrix operands conformal to the processor cache hierarchy. This specific nesting of the loops, together with a proper packing of A, B (see the bottom right plot in Fig. 1) plus a careful selection of the cache configuration parameters $m_{c}, n_{c}, k_{c}$ [11], favour that, during the execution of the micro-kernel, the buffers $ A_{c}, B_{c}$, respectively, remain in the L2 and L3 cache memories.

Micro-kernel for SIMD processors The micro-kernel streams an $m_{r} \times n_{r}$ micro-tile $C_{r}$ of C from the main memory into the processor registers; an $m_{r} \times k_{c}$ micro-panel $A_{r}$ of $A_{c}$ from the L2 cache; and a $k_{c} \times n_{r}$ micro-panel $B_{r}$ of $B_{c}$ from the L1 cache; see the bottom-left plot in Fig. 1. Packings also ensure that the contents of buffers $A_{r}, B_{r}$ are accessed with a unit stride during the execution of the micro-kernel, enabling the use of vector instructions to retrieve their contents. The arithmetic-to-memory access ratio (or arithmetic intensity [12]) of the micro-kernel is given by

$$\begin{aligned} \frac{2m_{r}\,n_{r}\,k_{c}}{(m_{r} +n_{r})\,k_{c}} =\frac{2m_{r}\,n_{r}}{m_{r} + n_{r}} \quad \text {flops}\,\text {per}\,\text {memory}\,\text {access}.\end{aligned}$$

Choosing large values for $m_{r} \approx n_{r}$ thus maximises this ratio. For the same reason, it is convenient to maximise the use of vector registers, without incurring into register spilling [11], by ensuring that

where ${\texttt {vl}}\_{\texttt {fp}}$ denotes the number of floating point numbers (elements) that fit into a single vector register and ${\texttt {vr}}$ is the number of vector registers.

Parallelism The multi-threaded parallelization of gemm has been analysed for conventional multicore processors, modern many-threaded architectures, and asymmetric ARM-based processors in [7, 13, 14]. In those works, the parallelization of gemm is approached by extracting parallelism from any of the loops L1, L3, L4, L5, or a combination of them. (Loops L2 and L6 are not convenient as they introduce race conditions.)

The parallelization technique is rather orthogonal to the micro-kernel: As the parallelization targets one (or more) of the five outermost loops of gemm (L1–L5, except L2), and the micro-kernel only comprises the sixth loop (L6), any of the micro-kernels proposed in this work can be combined with the parallel approaches proposed in the literature. In consequence, and in order to keep the paper focused on the generation of the micro-kernel, we will omit the analysis of parallelism in the following.

3 ARM NEON micro-kernels for GEMM using vector intrinsics

In this section, we pursue the development of architecture-specific SIMD-based micro-kernels for ARM processors using vector intrinsics. However, instead of a manual development process, we demonstrate that it is feasible to generate the micro-kernels automatically, significantly easing this task while delivering fair performance.

For simplicity, hereafter we choose the 32-bit floating point (FP32) as the basic data type for all the routines/codes presented in the section (in C language, float). Furthermore, we target ARM NEON v8.2, for which the vector length is 128 bits (i.e. 16 bytes) so that, for FP32, vl_fp32=4 (FP32 numbers per vector register). The same ideas for automatic generation apply to other data types and SIMD-enabled processor architectures.

3.1 A simple generic micro-kernel

In [9], we took a significant step forward toward improving the portability and maintainability of the BLAS by introducing a “generic” (i.e. multiplatform) scheme that relies on C macros to abstract the vector data type and the basic vector intrinsics for load, store, and axpy (scalar $\alpha $ times x plus y) update. In addition, when supported by the compiler, the generic micro-kernel can accommodate a micro-tile of any dimension $m_{r} \times n_{r}$ using arrays of vector registers.

In order to present the solution in [9], assume for simplicity that $m_{r}, n_{r}$ are both integer multiples of vl_fp32. We then need $m_{v} \, n_{r} = (m_{r}/{\texttt {vl}}\_{\texttt {fp}}32) \, n_{r}$ vector registers to store the micro-tile $C_{r}$; $m_{v}$ for a single column of $A_{r}$; and $n_{v} = n_{r}/{\texttt {vl}}\_{\texttt {fp}}32$ for a single row of $B_{r}$. Listing 1 displays our original generic micro-kernel, where we highlight a couple of details:

Prior to the main loop, indexed by kr (line 25), we load the contents of the $m_{r} \times n_{r}$ micro-tile of C into the array of vector registers Cr via two nested loops (lines 21–23). The transfer in the opposite direction, from the vector registers back to the memory, is carried out after the main loop (lines 40–42).
At each iteration of the main loop, a column of $m_{r}$ elements of $A_{r}$ and a row of $n_{r}$ elements of $B_{r}$ are first loaded into the appropriate vector registers (lines 27–28 for ar and lines 29–30 for br, respectively). These elements then participate in the update of the micro-tile stored in Cr (lines 33–36).

The generic micro-kernel in Listing 1 is customised for ARM NEON (v8.2) and FP32 using the C macros displayed in Listing 2.

3.2 Evolving the generic micro-kernel

For the gemm baseline algorithm, when $m_{c}, n_{c}$ are not integer multiples of $m_{r}, n_{r}$, respectively, a significant benefit can be obtained by developing specialised micro-kernels which employ SIMD instructions to update the “corner” cases. To illustrate this, imagine we have adopted a $8 \times 8$ micro-kernel as the baseline. Unfortunately, we will encounter certain cases where it is necessary to process a micro-tile of smaller dimensions, for example $4 \times 8$. In that particular case, it may be more efficient to employ a $4 \times 8$ micro-kernel with vector instructions to update this smaller micro-tile than to employ a scalar (i.e. non-SIMD) routine the baseline $8 \times 8$ micro-kernel. Furthermore, the corner cases where the any or both micro-tile dimensions are not integer multiples of the vector register length (e.g. $3 \times 7$) can be dealt via a micro-kernel of dimensions immediately superior that are integer multiples (e.g. $m_{r} \times n_{r}=4 \times 8$), exploiting the fact that the buffers $A_{c}, B_{c}$ will accommodate this excess in the dimensions and using scalar instructions to load and store only the necessary elements of $C_{r}$.

To address these scenarios, the top part of Listing 3 presents an enhanced version of the generic micro-kernel in Listing 1 that utilises C++ templates to facilitate the generation of a collection of micro-kernels for any combination of $m_{r}$ and $n_{r}$. To achieve this, a set of auxiliary template functions are responsible for unrolling the micro-kernel loops using recursive integer template parameters and constant conditional (static-if) expressions specific to the C++17 standard (see Listing 10 in the appendix for details). During compilation, these functions are evaluated to generate the appropriate instructions within the loop body based on the values of $m_{r}$ and $n_{r}$. For instance, in the main function depicted at the bottom part of Listing 3, we instantiate gemm micro-kernels for sizes $4 \times 4$, $4 \times 8$, and $8 \times 4$.

In conclusion, as in the case of the initial generic micro-kernel, this template-based version produces customised code for any architecture. It also accommodates the generation of code for a family of micro-kernels of different sizes at compile time.

4 ARM NEON micro-kernels for GEMM using assembly

In this section, we also address automatic generation of micro-kernels, but now targeting architecture-specific SIMD-based routines for ARM NEON processors using assembly instead of vector intrinsics.

4.1 Simple micro-kernels using ARM NEON assembly

In Listing 4, we show a simple micro-kernel of dimension $m_{r} \times n_{r} = 4 \times 4$ for FP32 data and encoded using ARM NEON assembly. The routine receives the same five parameters as its counterpart with vector intrinsics in Listing 1: The dimension $k_{c}$; address pointers to $A_{r}$, $B_{r}$ and C; and the leading dimension ldC. Note the connection between the two versions of the micro-kernel, emphasised with the use of the same comments for those blocks of the two codes that perform analogous functions. The assembly routine proceeds as follows:

From the address pointer C ($\equiv $ C00) to the appropriate entry of the matrix C, the routine initialises the pointers C01, C02, C03 to the remaining three columns of the $4 \times 4$ micro-tile taking into account the matrix column stride (lines 9–11).
Prior to the main loop, four columns of C, each consisting of four FP32 numbers, are loaded into four vector registers: Crq00–Crq03, using the assembly SIMD instruction ldr, which has a analogous function to that of the vector intrinsics vld1q_f32 (lines 13–16). After the loop, the contents of these vector registers are stored back into the matrix C using the assembly SIMD instruction str, with an analogous function to that of the vector intrinsics vst1q_f32 (lines 33–36). Note that CrqXY and CrvXY refer to the same register, but are referenced depending on whether the register is, respectively, involved in a memory (load/store) instruction or an arithmetic operation.
At each iteration of the main loop, the routine loads one column of $A_{r}$ into a vector register arq0 (line 19), one row of $B_{r}$ into a vector register brq0 (line 21), and updates Crv00–Crv03 via four AXPYs (lines 24–27) using the assembly SIMD instruction fmla, which performs a vector fused multiply-add (functionally equivalent to the vector intrinsic vfmaq_laneq_f32.

The $4\times 4$ micro-kernel in Listing 4 exhibits a regular structure that is possible to generalise to many other dimensions. (This is demonstrated, for example, with the excerpt of code in Listing 11 provided in the appendix, which corresponds to the main loop of a $12 \times 8$ assembly micro-kernel for ARM NEON and FP32 data.) Concretely, we identify the following characteristics which are independent of the micro-kernel dimensions:

1.
The contents of the micro-tile are retrieved from memory into vector registers prior to the loop and written from there back to memory after it.
2.
The micro-kernel employs $m_{v} \times n_{r}$ vector registers to keep the contents of the $m_{r} \times n_{r}$ micro-tile.
3.
At each iteration, the loop loads the contents of one column of $A_{r}$ and one row of $B_{r}$, and then uses them to update the micro-tile once.

This similarity is the basis that motivates the automatic generation of assembly micro-kernels.

4.2 A python generator of micro-kernels using ARM NEON assembly

A basic generator The regularity of the basic micro-kernels can be leveraged to automatically elaborate their code using the Python routine in Listing 6. (Indeed, the ARM NEON assembly codes for the $4 \times 4$ and $12 \times 8$ micro-kernels in Listings 4, 5, and 11 were obtained using this generator.) Inspecting the instructions of the generator, we can easily identify the different parts that produce the code fragments for the load/store of C, $A_{r} $, $B_{r}$, and the arithmetic. Note that this generator assumes that C is stored in column-major order. In case C is stored in row-major order, we can still use the same micro-kernel by swapping the roles of $A_{r}$ and $B_{r}$ and adjusting the leading dimension of C accordingly.

The simple generator in Listing 6 builds a micro-kernel that operates with an $m_{r} \times n_{r} $ micro-tile of C, assuming there are enough vector registers for this. This can result in a compilation error if the number of utilised vector registers exceeds the maximum. This would be the case, for example, of a $16 \times 8$ micro-kernel, which would require 32 vector registers for the micro-tile of C, 4 for the column of $A_{r}$, and 2 for the row of $B_{r}$, for a total of 38. In the actual generator, this type of situations are avoided with a simple logic test.

Automatic generation of advanced micro-kernels The basic generator has been extended to produce more sophisticated micro-kernels enhanced with advanced techniques such as loop unrolling and software pipelining [15]. For example, the former can be accommodated in the $4 \times 4$ micro-kernel using the macro in Fig. 7, which comprises the loads of $A_{r}, B_{r}$ and the arithmetic in Listing 4 (lines 19–27).

We can then generate the main loop of the micro-kernel with an unrolling factor of 4 by replicating the loop body that number of times, via the macro, as shown in Listing 8. For simplicity, we do not show here how to extend the code for the cases where $k_{c}$ is not an integer multiple of 4.

A complementary technique that can be integrated in the automatic generator is Software pipelining is a complementary technique which, during an iteration, pre-loads the data that will be utilised in the “next” iteration of the main loop, separating these memory accesses from the arithmetic operations where the data is utilised. The excerpt of code in Listing 9 combines software pipelining and loop unrolling with a factor of 4. Again, for simplicity, we do not discuss the code required to cover the final iterations or the cases where $k_{c}$ is not an integer multiple of 4.

Dealing with “corner” cases Our Python generator for assembly micro-kernels is not oblivious to corner cases that were already discussed in the case of vector intrinsics. Indeed, the Python generator takes this into account and, when asked to produce an $m_{r} \times n_{r}$ micro-kernel, it actually builds the requested one plus a full collection of smaller micro-kernels to tackle other micro-tile dimensions. In addition, there is a complete logic that is integrated into the gemm routine and invokes the micro-kernel that better matches the specific dimensions of each corner case.

5 Experimental evaluation

In this section, we assess the performance of the gemm realisations embedding the automatically generated micro-kernels. For reference, we include in the comparison an evaluation of the gemm in optimised instances of BLIS, OpenBLAS and ArmPL for the target platforms.

5.1 Problem cases

Much of the interest of our work lies in the fact that the matrix multiplication kernel is the backbone of the convolution operation, once the im2col (or im2row) transform casts this operator into a gemm. The convolution is found in well-known neural network layers for signal processing (including computer vision) and, moreover, bears most of the computational weight of model execution. For example, in [16] we report that the convolution layers in the ResNet-50 v1.5 model combined with ImageNet can consume between 45% and 87% of the inference time, depending on the optimisations that are applied. Thus, given the interest in deploying deep learning technologies, the dataset for the experimentation here includes matrix multiplications with their dimensions determined by the application of im2col to the convolution layers in the neural networks ResNet-50 v1.5 [17] and GoogleLeNet [18], combined with the ImageNet dataset. In the experiments, the batch size is set to 1 sample, reflecting a latency-oriented scenario [16, 19].

5.2 Hardware setup

In the evaluation, we target the following three ARM-based development platforms:

An NVIDIA Cortex-A78AE processor, embedded in the NVIDIA Jetson AGX Orin board, with a 64-KB L1 data cache, a 256-KB L2 cache, a 2-MB L3 cache, and a 32-GB LPDDR5 memory.
An NVIDIA Carmel processor in the NVIDIA Jetson AGX Xavier platform, with a 64-KB L1 data cache, a 2-MB L2 cache, a 4-MB L3 cache, and a 16-GB LPDDR4x memory.
An NVIDIA Cortex A57 processor, in the NVIDIA Jetson Nano board, with a 32-KB L1 data cache, a 2-MB L2 cache, and a 4-GB LPDDR4 memory.

These target systems, listed from highest to lowest computational power, are representative of the type of equipment that can be used to run machine learning inference workloads.

In order to reduce variability in the experiments, the frequency of the processor cores is fixed in all cases. A single core is employed in the three architectures, with a thread bound to it. All experiments are carried out in IEEE FP32 arithmetic, and they are repeated a large number of times, reporting the average results. Performance is measured in terms of billions of floating point operations per second (GFLOPS) or, in the final part of the section, in execution time (s).

5.3 Software setup

We focus on the performance gains that can be obtained when leveraging specific micro-kernels for the convolution operators in the ResNet and GoogleLeNet neural networks. The goals are to show the performance obtained in each layer with our gemm using the best micro-kernel for that layer and to demonstrate that, by choosing the appropriate computational kernel, i.e. the gemm with the appropriate micro-kernel dimensions and optimisation techniques, it is possible to obtain performance similar, or even superior in many cases, to that offered by the implementation of gemm in optimised libraries. Concretely, for reference, the comparison includes data for the gemm realisations contained in BLIS (version v0.8.1) [13], OpenBLAS (version v0.3.19) [8], and ArmPL (version v21.1) [20].

5.4 Performance per layer

Figures 2, 2 and 4 report the performance of the five gemm implementations on the three platforms, for the individual convolutional layers present in the two convolutional neural networks (CNNs). For the two most powerful platforms, NVIDIA Jetson AGX Orin and Xavier, the results are similar, with our automatically generated micro-kernels being, in a large majority of layers, among the top-3 best options; and in many cases, offering the best choice. The results are quite different for the NVIDIA Jetson Nano, where BLIS and OpenBLAS present superior performance. In consequence, we comment these two cases separately.

NVIDIA Jetson AGX Xavier/Orin As the number of results is large, we will focus our comments on one scenario that we believe is representative of the remaining cases on these two platforms and both CNN models. In particular, we describe in detail the outcome of the execution of the convolutional layers in ResNet-50 v1.5 on the NVIDIA Jetson AGX Xavier; see the top plot in Fig. 3. The results there show that the pair of solutions that integrate our automatically generated micro-kernels (labeled as Autogen-ASM and Autogen-Templates) outperform the library-based implementations (labeled as BLIS, OpenBLAS, and ARMPL) for layers #2 to #5, #7, #10 to #12, #15, #16 (10 cases out of 20, that is, 50%). In contrast, ARMPL is the best option for layers #13, #17, #18 (3 cases out of 20, 15%). For layers #1, #6, #8, #9, #14, #20 (6 cases, 30%), the best performance is attained by Autogen-ASM/Templates and ARMPL, with little differences between them. Finally, in one layer, #19, Autogen-ASM/Templates and BLIS offer similar performance, superior to that of the other alternatives. In summary, our automatically generated codes deliver the highest performance in all except three layers, where they are outperformed by ARMPL.

In order to characterise these results, we first need to link them with the dimensions of the gemm associated with each layer; see Table 1. Depending on the value of m, a classification can be established into four groups of problems: large, medium, small, and tiny. This clustering offers a characterisation of performance for the two groups in the middle. Concretely, for medium m and $k\ge 512$, Autogen-ASM/Templates offers the best performance while, for the same group and smaller k, the performance of that option is similar to that of ARMPL. Similarly, for small m and large $k (\ge $1,024) Autogen-ASM/Templates is again the best but its performance tends to decay with respect to ARMPL as k decreases. In general, for large m the best automatically generated micro-kernel is $m_{r} \times n_{r}=20 \times 4$, moving towards other variants with smaller $m_{r}$ ($16 \times 4$, $12 \times 8$, $8 \times 12$) as m becomes also smaller.

Table 1 Dimension of the gemm problems associated with the convolutional layers in ResNet-50 v1.5+ImageNet for batch size $b=1$

Full size table

Explaining the different behaviour of the gemm instances requires a careful case-by-case analysis as it is the consequence of a combination of factors related to the micro-kernel, that we revise in the following list:

Micro-kernel. The dimensions of the micro-kernel and the ratio $m_{r}/n_{r}$ determine its arithmetic intensity [12] and, therefore, its performance under ideal conditions. For example, in a separate experiment, we could determine that, on the NVIDIA Jetson AGX Xavier, the larger micro-kernels automatically generated with our tool ($m_{r} \times n_{r}$= 8$\times $12, 12$\times $8, 4$\times $16, 4$\times $20, 16$\times $4, and 20$\times $4, are compute-bound while the smaller ones are bounded by the L2 cache bandwidth.
Dimensions $n_{r}$ and k. According to the GotoBLAS scheme, an $k_{c} \times n_{r} $ micro-panel of B should populate a significant fraction of the L1 cache, proportional to the ratio $n_{r}/ (m_{r} + n_{r})$ [11]. Now, looking to the dimension k of the problems in Table 1, we observe that $k_{c} \le k$ and, therefore, a large value $n_{r} $, which depends on that dimension of the micro-kernel, can yield a more efficient use of L1 cache even when $k_{c} =k$ is small.
Packing routines. Fourth, the packing routines help to reduce cache eviction but, at the same time, introduce a certain overhead, which can be significant in case there is not enough reuse of the buffers $A_{c}, B_{c}$. As the matrices A/B can be stored in either row- or column-major order, and they have to be packed into narrow micro-panels, respectively, of $m_{r} $ rows/$n_{r} $ columns, the dimension of the micro-kernel interacts with the storage format to impact the cost of the packing procedures.
Edge cases. The cache configuration parameters $m_{c}, n_{c}, k_{c} $, and the micro-kernel dimensions $m_{r},n_{r} $ decompose the gemm problem into a collection of packing operations and micro-kernels of different sizes. When m is not an integer multiple of $m_{c} $ and/or the latter is not an integer multiple of $m_{r} $, these edge cases could benefit from specialised routines, which are not always efficiently implemented. The same applies to the trio $n,n_{c}, n_{r}$; and the pair $k,k_{c} $.

Note that in this list we have omitted the cache configuration parameters. These are set differently for each library/platform and obviously play a role on the performance as they have a direct impact on the utilisation of the cache hierarchy. In summary, explaining how all these factors interact to determine the overall performance of a specific gemm implementation, for a particular problem dimension, is very interesting, but also quite complex, especially for sophisticated libraries implemented by others.

NVIDIA Jetson Nano. The performance of the five routines on this system show a wider variability of the best option. Our gemm routine does not stand out as optimal yet, in general, it does not lose track with respect to the best option. Concretely, we find layers for which some routines, like ARMPL, perform well for many layers (#7, #20 of ResNet-50 v1.5) but quite poorly for others (layers #12, #16). OpenBLAS is better tuned for the core in Nano than for that in Xavier. The regularity in performance of Autogen-ASM is one of the most remarkable aspects of our proposal compared with the use of optimised libraries.

The superior performance of BLIS and OpenBLAS for the NVIDIA Jetson Nano is mainly due to a couple of factors: 1) BLIS and OpenBLAS contain micro-kernels with extensive use of hardware prefetching; and 2) BLIS and OpenBLAS provide vectorised version of the packing routines. From our observations, while these two factors have no major effect on the two other platforms, on a resource-constrained system such as the NVIDIA Jetson Nano, they explain the superior performance of the instances of gemm in these libraries.

5.5 Aggregated time

The GFLOPS rate provides a normalised metric to evaluate the performance of the different implementations of gemm, but it does not reveal the contribution of the layers to the total execution time and therefore the relevance of the differences. The final experiment, with results shown in Figs. 5, 6, and 7, shows the aggregated time on the two CNN models and the three platforms. To reflect a realistic execution, we report the execution time for all the convolution layers in the CNN models, not only those with different dimensions, as was the case of the previous experiments in this section.

The results show that, when comparing the option with automatically generated micro-kernels to the best library-based solution, the overall gain for the NVIDIA Jetson AGX Orin is small, between 2% and 3% depending on the CNN model; it is larger for the NVIDIA Jetson AGX Xavier, between 8% and 12%. Finally, in the NVIDIA Jetson Nano, the loss is between 6% and 10%.

6 Conclusion

We have proposed two approaches to automatically generate gemm micro-kernels that mimic the encoding effort done by an expert, relieving this programmer from a significant part of the effort required in (the initial steps of) this error-prone task. Concretely, our generators produce either a C code with vector intrinsics or directly an assembly routine, for any data type and micro-kernel dimensions. Furthermore, they integrate high performance techniques such as loop unrolling and software pipelining.

Our experimental study shows the benefits of the automatic solution in comparison with optimised implementations of gemm in state-of-the-art libraries for three ARM-based processors and a representative collection of problem instances. The possibility of dynamically generating a family of micro-kernels, choosing the most efficient one as a function of the problem dimension, is demonstrated to be key to outperforming the static implementation of gemm in this libraries, which only include a single micro-kernel per architecture.

Data availability

Not applicable.

References

Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Article MathSciNet Google Scholar
Kågström B, Ling P, van Loan C (1998) GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans Math Softw 24(3):268–302
Article Google Scholar
Goto K, van de Geijn R (2008) High-performance implementation of the level-3 BLAS. ACM Trans Math Soft 35(1):1–14
Article MathSciNet Google Scholar
Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Article Google Scholar
Ben-Nun T, Hoefler T (2019) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput Surv 52(4):65:1-65:43
Google Scholar
Goto K, van de Geijn RA (2008) Anatomy of a high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1-12:25
Article MathSciNet Google Scholar
Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1-14:33
MathSciNet Google Scholar
OpenBLAS, http://xianyi.github.com/OpenBLAS/ (2012)
Alaejos G, Castelló A, Martínez H, Alonso-Jordá P, Igual FD, Quintana-Ortí ES (2023) Micro-kernels for portable and efficient matrix multiplication in deep learning. J Supercomput 79:8124–8147
Article Google Scholar
Martínez H, Catalán S, Igual FD, Herrero JR, Rodríguez-Sánchez R, Quintana-Ortí ES (2023) Co-design of the dense linear algebra software stack for multicore processors, arXiv:2304.14480
Low TM, Igual FD, Smith TM, Quintana-Ortí ES (2016) Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw 43(2):12:1-12:18
MathSciNet Google Scholar
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. https://doi.org/10.1145/1498765.1498785
Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: Experiments in portability. ACM Trans Math Softw 42(2). https://doi.org/10.1145/2755561
Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Clust Comput 19(3):1037–1051
Article Google Scholar
Dowd K, Severance CR (1998) High performance computing, 2nd edn. O’Reilly
Barrachina S, Dolz MF, San Juan P, Quintana-Ortí ES (2022) Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors. J Parallel Distrib Comput 167(C):240–254
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Szegedy C, et al (2014) Going deeper with convolutions. CoRR [Online]. Available: arXiv:1409.4842
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International Workshop on Frontiers in Handwriting Recognition
ArmPL: Arm Performance Libraries, https://developer.arm.com/downloads/-/arm-performance-libraries. Accessed July 2023

Download references

Acknowledgements

This work received funding from projects PID2020-113656RB and PID2021-12657NB-I00 of MCIN/AEI/ 10.13039/501100011033; PROMETEO 2023-CIPROM/2022/20 and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 (eFlows4HPC project). The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. A. Castelló is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. M. F. Dolz is supported by the Plan Gen–T grant CIDEXG/2022/013 of the Generalitat Valenciana. H. Martínez is a POSTDOC_21_00025 postdoctoral fellow supported by Junta de Andalucía.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. European Commission, European Union, 95555. Junta de Andalucía, POSTDOC_21_00025 Agencia Estatal de Investigación, FJC2019-039222, PID2020-113656R, PID2021-12657NB-I00. Generalitat Valenciana, CIDEXG/2022/013, PROMETEO 2023-CIPROM/2022/20

Author information

Authors and Affiliations

Universitat Politècnica de València, València, Spain
Guillermo Alaejos, Adrián Castelló, Pedro Alonso-Jordá & Enrique S. Quintana-Ortí
Universidad de Córdoba, Córdoba, Spain
Héctor Martínez
Universitat Jaume I de Castelló, Castelló de la Plana, Spain
Manuel F. Dolz
Universidad Complutense de Madrid, Madrid, Spain
Francisco D. Igual

Authors

Guillermo Alaejos
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Castelló
View author publications
You can also search for this author in PubMed Google Scholar
Manuel F. Dolz
View author publications
You can also search for this author in PubMed Google Scholar
Francisco D. Igual
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Alonso-Jordá
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.A executed the experiments, H.M. implemented the assembly generation, and A.C implemented the generic generator script and wrote different sections of the paper. H.M, P.A. and A.C reviewed G.A work and methodology. M.D. implemented the C++ template approach. F.I. and E.Q conducted the research and wrote several paper sections. All authors reviewed the manuscript.

Corresponding author

Correspondence to Adrián Castelló.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Additional code snippets

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Alaejos, G., Martínez, H., Castelló, A. et al. Automatic generation of ARM NEON micro-kernels for matrix multiplication. J Supercomput 80, 13873–13899 (2024). https://doi.org/10.1007/s11227-024-05955-8

Download citation

Accepted: 03 February 2024
Published: 12 March 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11227-024-05955-8

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Abstract

Similar content being viewed by others

Micro-kernels for portable and efficient matrix multiplication in deep learning

Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

1 Introduction

2 Baseline implementation of GEMM