Parallel GEMM-based Convolution for Deep Learning on Multicore RISC-V Processors

We address the eﬃcient implementation of the convolution operator on the GAP8 parallel ultra-low power platform (PULP), an heterogeneous multi-core processor equipped with a fabric controller (FC); a cluster of eight compute cores; and a four-level memory hierarchy with scratchpads instead of conventional, hardware-assisted cache memories. Our solution for this platform transforms the convolution into a general matrix-matrix multiplication ( gemm ) via the lowering approach, demonstrating that it is possible to attain reasonable performance on the GAP8 by carefully adapting techniques such as tiling and loop parallelism, which are mainstream in the multi-threaded, cache-aware realization of gemm .


Introduction
Implementing deep learning (DL) algorithms on edge devices for Internet of Things (IoT) applications is critical to enhance privacy and security.In addition, moving the computation from the cloud to IoT nodes closer to sensors can significantly reduce the amount of data sent over the network, thereby reducing latency and energy consumption [1,2,3].The wide variety of IoT applications, many of which rely on DL technologies, has led to a broad range of edge processor architectures, including cores with RISC-V ISA (instruction set architecture) [4].This diversity, combined with severe constraints on power, memory capacity and computational performance for edge devices, asks for a careful selection of algorithms and the optimization of the software running on them.
In this work, we focus on the implementation of convolutional deep neural networks (DNNs) on edge processors.With this objective, we parallelize a popular algorithm for the convolution operator based on the lowering approach, which decomposes the operation into a data replication transform, known as im2col or im2row, followed by a general matrix-matrix multiplication (gemm) [5].Moreover, we target the heterogeneous 1+8 RISC-V cores integrated into the GAP8 parallel ultra-low power platform (PULP) for IoT.In more detail, this paper makes the following contributions: -We develop a high performance, multi-threaded implementation of gemm that operates with 8-bit integer (INT8) data and arithmetic on top of the dot (scalar or inner) product, a basic kernel that receives special support in the GAP8.In our solution, the 8 compute cores of the GAP8 are in charge of all the arithmetic while the remaining core, known as the fabric controller (FC), coordinates the data movements.-We orchestrate a careful sequence of data transfers across the memory areas of the GAP8 via DMA transfers, embedding these movements into the tiling techniques of a parallel blocked algorithm for gemm.-We perform a complete experimental evaluation of the convolution realization for the two afore-mentioned transforms: im2col and im2row.
The rest of the paper is structured as follows.In Section 2 we briefly present the convolution operator and in Section 3 we review the high performance implementation of gemm on multicore processors with a multi-layered memory including caches.In Section 4 we detail the main features of the GAP8 platform.In Section 5 we describe the approach to obtain a parallel high performance algorithm for gemm on the GAP8 system.In Section 6, we evaluate the resulting routine.Finally, in Section 7 we close the paper with a few concluding remarks.

Convolution via IM2COL+GEMM
In this section, we first introduce the convolution operation [6] to then present the im2col and im2row transforms which, combined with gemm, potentially yield a high performance approach to compute this operator, at the cost of an augmented workspace and some data copies [5].
known as batch size), c i specifies the number of input image channels, and h i × w i are the input image height × width.In addition, the convolution also receives an input filter (or kernel) tensor where c o is the number of filters and h f × w f denote the filter height × width.Following with the operator definition: The algorithm in Figure 1 provides a direct realization of the convolution operator.There, each individual filter combines a subset of the inputs, with the same dimension as the filter, to produce a single scalar value (or entry) in one of the c o outputs.By repeatedly applying the filter to the whole input, with a certain horizontal/vertical stride s, the convolution operator thus obtains the entries of this single output [6].Assuming vertical and horizontal padding factors given by p h and p w , respectively, the output height × width dimensions are given by

Indirect convolution via the im2col/im2row transforms
On current computer architectures, the performance of the direct algorithm in Figure 1 is strongly constrained by the memory bandwidth and, therefore, this approach in general delivers only a fraction of the processor peak floatingpoint throughput.In practice, this drawback is usually tackled by adopting an indirect or gemm-based approach which casts this operator in terms of a matrix multiplication via either the im2col or im2row transform [5].This realization is often referred to as the lowering algorithm.
In short detail, Figure 2 displays the im2col algorithm1 that "flattens" the 4-dimensional (4D) input tensor I into an augmented (2-dimensional, 2D) matrix B so that the output of the convolution can be then obtained from the gemm is the aforementioned augmented matrix.For simplicity, the algorithm shown in Figure 2 does not take into account the memory access when the stride of the convolution is higher than one.In addition, the actual implementation of this transform eliminates some of the loop invariants inside several loops to reduce the indexing overhead.

Blocked Algorithms for GEMM
With the convolution operator flattened into a matrix multiplication via the im2col (or im2row) transform, this section reviews the conventional strategy to obtain a high performance realization of gemm on current processor architectures with deep cache memory hierarchies and single-instruction multipledata (SIMD) vector units.

The baseline algorithm for GEMM
Current high-performance implementations of gemm, in both open-source and commercial linear algebra libraries, follow GotoBLAS [7] to formulate this kernel as a collection of five nested loops around two packing routines and a micro-kernel ; see Figure 3 (top).In rough detail, the instances of gemm in these libraries apply tiling (blocking) as follows: -A k c × n c block of matrix B is packed into a buffer B c , intended to reside in the L3 cache memory (or main memory, in case there is no L3 cache); see line 4 in the algorithm.-An m c × k c block of matrix A is packed into a buffer A c , designated for the L2 cache memory; line 7. -During the micro-kernel execution (lines 11-14), a specific k c × n r block of B c , referred to as the micro-panel B r , is expected to lie in the L1 cache memory.-The micro-kernel performs the arithmetic, in principle accessing the data for A c from the L2 cache, for B r from the L1 cache, and for C directly from main memory.
The data transfers across the memory hierarchy are illustrated in Figure 4.In addition, packing A c , B c as in Figure 5 ensures that their entries are retrieved with unit stride from the micro-kernel.The baseline algorithm for gemm, also referred as B3A2C0,2 features a micro-kernel that includes the sixth loop, iterating over the k c dimension.This component of the algorithm is the only one encoded directly in assembly or in C with vector intrinsics; see Figure 3 (top).At each iteration of the loop, the micro-kernel updates an m r × n r micro-tile of C, say C r , by performing an outer product involving (part of) one row of the buffer A c and one column of the micro-panel B r .Here C r is a notation artifact, introduced to ease the presentation of the algorithm while A c and B c are actual buffers that maintain copies of certain blocks of A and B.

Alternative algorithms for GEMM
By re-arranging the gemm loops in the baseline algorithm in Figure 3 (top) in a distinct order, combined with an appropriate selection of the loop strides, we can obtain different algorithmic variants of gemm, which favor that certain blocks of A, B, C reside in distinct levels of the memory hierarchy [8,9,10].Concretely, Figure 3 allows a visual comparison between the codes for the B3A2C0 (baseline) and B3C2A0 variants, implicitly exposing the following major differences between the two: -In B3C2A0, an m c × n c block of C is packed into a buffer C c for the L2 cache; line 7 in the algorithm.Moreover, this variant also requires an unpacking step that moves the entries of C c into C once the micro-kernel is executed; line 16.-In order to ensure accessing the entries of C, B with unit stride from the micro-kernel for B3C2A0, both C c and B c are stored following the same pattern shown for A c in Figure 5, with the entries of C c arranged into micro-panels of m r rows and those of B c into micro-panels of k r rows.-The micro-kernel for B3C2A0 operates with a m r × k r micro-tile of A streamed directly from the memory into the registers, where it will reside during the full execution of the micro-kernel.This performs a small, m r ×k r matrix-vector product per iteration of Loop L6 (for a total of n c iterations), each involving a single column of a micro-panel C r and a single column of a micro-panel B r ; lines 11-14.
As we will discuss in Section 4, variant B3C2A0 presents several characteristics that make it especially interesting for its implementation on the GAP8 platform.

The GAP8 Platform
The GAP8 is a commercial platform designed for IoT applications, with a processor based on the PULP [11] architecture.As depicted in Figure 6, the GAP8  processor embeds three main computing components: 1) A low-power microcontroller unit (MCU), known as the FC, which is responsible for managing control, communications, and security functions; 2) a compute engine (CE) comprising a cluster of 8 compute cores specifically designed for the execution of parallel algorithms; and 3) a specialized hardware accelerator (HWCE) that is part of the CE as well.
The FC integrates a read-only memory (ROM) that stores the primary boot code, plus a private 16-KB L1 scratchpad (also referred to as memory area or MA).On the CE side, the compute cores and the HWCE share a 64-KB multi-banked Tightly-Coupled Data Memory (TCDM) L1 scratchpad (MA).Moreover, FC and CE share a 512-KB L2 scratchpad.The device also includes an 8-MB L3 MA that acts as the platform main memory and is accessible from the FC.To enable rapid data transfers between MAs, the platform features two direct memory access (DMA) units.One of these units assists in transferring data between the FC domain and the CE domain, while the micro-DMA unit transfers data to/from peripherals, including the L3 MA.
Both the FC and the cluster cores support the RISC-V RV32IMCXpulpV2 instruction set architecture (ISA), which includes integer (I) arithmetic, compressed instructions (C), multiplication and division (M) extensions, and a portion of the supervisor ISA subset.The RV32ICMXpulpV2 ISA extension also provides specialized instructions for zero overhead hardware loops, pointer post/pre-modified memory accesses, instructions mixing control flow with computation, multiply/subtract and accumulate, vector operations, fixed point operations, bit manipulation, and the dot product of two vectors.

Tailoring GEMM for the GAP8
In this section we describe our adaptation of gemm to run in parallel on the compute cores integrated into the GAP8 CE.This customization effort is strongly dictated by the following features of the GAP8 platform: -The compute cores offer special hardware for the dot product.
-The memory hierarchy in the platform is structured into four levels: vector registers, two intermediate scratchpad levels (L1, L2 MAs), and a main memory (also referred to as RAM or L3 MA).
-The system integrates scratchpads instead of conventional cache memories.
-A single FC controls the memory transfers between main memory and the L1, L2 scratchpads.-The CE features 8 compute cores.
As a first step, we followed the work described in [12], which addressed the sequential implementation of gemm on the FC, modifying that solution to target the compute cores in the CE.The resulting code operates with signed INT8 numbers, and presents the specific features described in the remainder of this section.

Micro-kernel for B3C2A0
A first aspect to note is that, as the FC and compute cores support the same RISC-V-oriented ISA, including the specialized instructions for the dot product and the vector registers, adapting the initial FC micro-kernel from [12], to the cluster cores basically required no changes.To illustrate this, Figure 7 displays a simplified version of a micro-kernel that operates with a 4×4 micro-tile A r , implementing the innermost loop in the B3C2A0 algorithm (see lines 11-14 in Figure 3 right) as follows: -The micro-kernel receives as input parameters 1) the starting address in main memory of the micro-tile A r (parameter Ar), which is assumed to be stored in row-major order; 2) the leading dimension of the matrix operand A (parameter ldA); At this point we note that we have implemented and tested micro-kernels of different "shapes" (or dimensions) m r × k r .Also, while the same microkernel can basically run on the FC and a single compute core, the data for the matrix operands must be placed into the appropriate MAs, which are different depending on which component (FC or compute core) has to execute the microkernel.We discuss this point in detail next.

Data transfers across the memory hierarchy
The GAP8 platform integrates L1 and L2 scratchpads, which give the programmer full control over the data transfers across the memory hierarchy but also the responsibility to orchestrate them.Our solution targeting the compute cores in the CE addresses this task by embedding the data movements naturally into the B3C2A0 algorithm as follows: -FC packs B into the buffer B c , with both data operands residing in the main memory.Thus, this data movement only involves the "hardware" in the MCU (scratchpads and core).-FC packs the data for C into C c , in this case transferring it from the memory to the L2 MA in the MCU.The transfer for the unpacking is also governed by the FC, but obviously carried out in the opposite direction.Again, these copies only involve the MCU part of the GAP8.-The FC copies the micro-panel B r from B c , from the L2 MA in the MCU to the L1 MA in the CE.
-The compute core that executes a micro-kernel expects that the data for B r resides in the L1 MA, for C c in the L2 MA, and for A in the main memory.However, the CE cannot directly access the data in the main memory, and therefore the FC copies the appropriate micro-tiles of A from there to the L1 MA in the CE.Next, inside the micro-kernel, each core streams the data of its own micro-tile from the L1 to its vector registers.
The data movements required for the collaboration of FC and compute cores are graphically illustrated in Figure 8.Although the plot there displays the execution of the parallel algorithm, to be discussed next, the data movements in the case of the sequential algorithm are basically the same, and can be derived by considering the movements involving Core 1 only.
To close this subsection, we remind that the loop strides for the B3C2A0 algorithm are set to n c , k c , m c , m r , k r (respectively for loops L1, L2,. . ., L5; see Figure 3 (right)).The last two variables, m r , k r , determine the shape of the micro-kernel and are usually adjusted depending on the number of vector registers per core.The first three variables are known as the cache configuration parameters, and they should be set according to the dimensions of the L1, L2, and L3 memory levels as well as the shape of the micro-kernel.For a sequential algorithm targeting a single compute core of the GAP8 platform, we have to take into account that where CL1, CL2, respectively denote the capacity of the L1, L2 MAs accessed by the compute core.

Parallelization
Following the conventional approach for the multi-threaded realization of gemm, we exploit loop parallelism for the B3C2A0 algorithm.The first question thus is which of the six loops appearing in the algorithm to target.In order to take this decision, we make the following observations about the code in Figure 3 (right): -Parallelizing Loop L1 (indexed by jc) partitions the operation into a collection of independent gemm kernels.The consequence is that this requires separate workspaces per thread for the C c , B c , B r buffers in each memory level, in practice dividing the capacity of these memories among the threads.Given that the compute cores in the CE share the L1, L2 MAs and, obviously, the main memory, this does not seem the best approach for an efficient collaboration.-A parallel algorithm targeting loops L2 (indexed by pc) or L4 (indexed by pr) faces race conditions because multiple threads may then update the same parts of the output matrix at the same time.It is possible to control this type of behavior using various software techniques (and, in Fig. 8: Data movements in the parallel version of the B3C2A0 algorithm.For simplicity, the data transfers corresponding to the streaming of the columns c 0 r , c 1 r , . . ., c 7 r from the corresponding 8 micro-panels of C c into the processor registers are annotated with arrows only for Core 1.The same applies from the streaming (replication) of the column b r from the micro-panel B r , and the streaming of the 8 micro-tiles A 0 r , A 1 r , . . ., A 7 r from A. some cases, hardware mechanisms), but in general they introduce a nonnegligible overhead.-Parallelizing loop L3 (indexed by ic) would require a separate buffer per thread for C c , B r .Therefore, for the same reasons as exposed for loop L1, this does not seem a good option from an inter-core collaboration perspective.
This analysis leaves only loops L5 (indexed by ir) or L6 (indexed by jr, inside the micro-kernel,) as potential candidates.To increase the granularity of the workload distribution and reduce thread synchronization we therefore choose the outermost option: loop L5.The memory target of the distinct operands and buffers and the data movements for the parallel algorithm are illustrated in Figure 8.Note that, with the selected parallelization scheme, all the compute cores access the same m c × n c buffer C c in the L2 MA, the same k r × n c micro-panel B r in the L1 MA, but a different m r × k r micro-tile A r in the L1 MA.For this reason, in the parallel case, the micro-kernel shape and cache configuration parameters must satisfy where c specifies the number of cores that participate in the parallel execution.
To fully leverage the capabilities of the GAP8 processor and improve the overall performance, we distribute the iteration space for loop L5 evenly across the 8 RISC-V compute cores in the GAP8 CE. Figure 9 displays the fragment of the parallel code that comprises loop L5 plus the invocation to the microkernel.The main differences with the sequential counterpart are: -All the cluster compute cores iterate over the ir loop, but each core only executes its "own" iterations.The workload is distributed following a simple round-robin policy with "chunks" of mr rows (see lines 10-11 in the algorithm).-The core in charge of executing a given micro-kernel instructs the FC to copy the m r × k r micro-tile A r from the memory to L1 MA (lines 14-18).-The core then executes the micro-kernel by invoking the sequential code presented earlier (line 21), and prepares the variables and address pointers for the next iteration (line 23).Remind that, from inside the micro-kernel, the core streams A r from L1 to its vector registers.-Synchronization points are included in lines 17 and 25.The former ensures that the data is already copied in the L1 MA, while the latter is for overall thread synchronization.We finally come back to the convolution operator to note that, in the case of the im2col transform, matrix Â contains the operator filters and, for inference, this tensor remains constant.The same applies to B in the case of the im2row transform.In the next section we describe how to we can take advantage of this to pre-pack the filter tensor and eliminate/accelerate some of the data transfers for the B3C2A0 algorithm.

Performance Analysis
In this section we evaluate the performance of the convolution operator based on the lowering approach, discussing the differences between the im2col or the im2row variants.Table 1: Parameters of the convolution layers arising in MobileNet-v1 and dimensions of the gemm obtained with the application of the im2col transform.For im2row, the values of columns m and n are swapped.Layer 28 boils down a matrix-vector product and, therefore, is omitted from the experiments.For all layers, the stride and vertical/horizontal paddings equal 1.

Setup
We have evaluated the performance of our parallel gemm-based convolution in the GAP8 platform using a real DL model and INT8 arithmetic.Specifically, we ran the inference phase for the convolutional layers in the MobileNet-v1 DNN, setting the input batch size b to 1 (i.e., a single input scenario).For this purpose, we pre-process the convolution operators using either the im2col or the im2row transform, obtaining gemm kernels of the form Ĉ = Â • B that operate with augmented matrices of different dimensions; see Section 2 and Table 1.
The results reported in this section are averaged for a 10-second execution of each experiment.We have implemented and tested micro-kernels of varying dimensions m r × k r .For brevity, for each convolution operator we only report the results obtained with the best-performing micro-kernel.For layers 1-2, 4-7, and 11-27 of MobileNet-v1, this corresponds to a micro-kernel with m r × k r = 4 × 24.For layers 3 and 8-10, the best micro-kernel was m r × k r = 4 × 20.

Preliminary analysis
As a starting point for our analysis, Figure 10  For reference, we first analyse the transfer costs and how they impact the efficiency of the sequential implementation.For brevity, we focus this study on layer #10 and the im2col transform.From Figure 10, we observe that the arithmetic cost for this layer is 3.73 s, which offers a sustained peak of about 247 MOPS.In comparison, the transfer costs amount to 3.70 s, which offers an efficiency that is close to 50% when we run the algorithm on a single core of the CE.
Let us turn our attention next to the parallel implementation.From the plots in Figure 10, it is clear that the two first components, Arithmetic and Stream Cc, significantly benefit from a parallel execution.In contrast, there is a different behavior for many other components, with a very small decrease of the execution time when using 4 compute cores and a negligible benefit for 8 compute cores.The explanation in all these cases is common: Those components that are executed inside the loop that is parallelized in our solution (loop L5), for which in addition there is no participation of FC, truly run in parallel and consequently are accelerated in a multi-threaded execution.This is the case of Arithmetic, Stream Cc, Stream Br, and Copy Br.In comparison, the remaining components require the participation of FC (basically to program the DMA transfers from/to L3) and, therefore, they are intrinsically sequential as there is only one FC.For the sequential execution, the cost is dominated by Arithmetic followed by Stream Cc and, for the im2row variant, Stream A. The lack of parallel scalability of the latter component for im2row exerts a strong impact on the cost of the parallel executions for that variant.
A significant difference between the costs of the im2col and im2row variants is visible for Stream A. The reason is mainly that this cost is proportional to the dimension m of the gemm associated with this layer: 256 for im2col and 784 for im2row (see Table 1).In addition, for im2col the filter matrix corresponds to the gemm matrix operand Â.Therefore, we have accelerated this type of transfer, from the main memory to the L1 MA, by 1) pre-packing the operand, so that the m r × k r elements of each microtile lie in contiguous positions in main memory; and 2) programming the DMA/FC to copy the (m r • k r ) elements of the micro-tile with a single call.This allows to replace the loop in lines 14-18 of Figure 9 with a single call to pi cl ram read+pi cl ram read wait.
Finally, comparing the distributions of costs between the im2col and im2row variants, we observe that there is no cost associated with Pack Bc for the latter.The reason is that, for im2row, the gemm matrix operand B corresponds to the convolution filters, which remain constant during the inference stage.In consequence, we can pre-pack this matrix (and re-utilize it for any number of subsequent inferences) so that the re-organization of the matrix becomes unnecessary.In contrast, for im2row the matrix operand Â corresponds to the augmented matrix that results from applying the transform to the input activation tensor.As these data vary from one sample to the next, we cannot pre-pack it.Therefore, for im2row we cannot benefit from the faster transfers of the micro-tiles between main memory and L1 described in the previous paragraph.Fig. 11: Performance attained for the convolutional layers in MobileNet-v1 using im2col+gemm (top) and im2row+gemm (bottom).The results with one core display the arithmetic rate (in millions of INT8 operations per second, or MOPS) observed with the "sequential" algorithm, executed using a single compute core and the FC.

Global comparison
Figure 11 shows the performance rates for all layers of MobileNet-v1, attained with the "sequential" version3 of the two convolution variants (im2col and im2row) as well as the corresponding speedup observed with the parallel counterpart, using 4 and 8 compute cores.In general, the sequential version using im2col slightly outperforms the alternative based on im2row for the initial layers (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11), but it is scantly inferior for the final layers.The reason for the similar behavior is that, in the sequential case, 1) there are no differences for Arithmetic, since both variants obviously perform the same number of arithmetic operations and this component does not include any data transfer cost; furthermore, 2) the differences between cost of Stream Cc for the two variants are negligible, as this depends on m, n, and these two values are simply swapped between the two variants.In the parallel case, the factor that can make a difference between im2col and im2row is the lack of scalability of Stream A, which has a significant contribution to the execution time for the latter.However, this is compensated by the Pack Bc component for im2col, which contributes a cost that the im2row counterpart does not have to pay.
With respect to the parallel algorithm, on 4 compute cores we observe a maximum speedup of 3.19 for im2col and slightly higher, 3.24, for im2row.With 8 compute cores, the maximum speedup is 5.00 for im2col and again marginally higher, 5.15, for im2row.The average speedup for im2col is 2.96 on 4 compute cores and 4.39 8 compute cores; for im2row, it is 2.71 on 4 compute cores and 4.09 on 8 compute cores.

im2col vs im2row
The previous discussion exposes the small performance differences between im2col and im2row.In practice, the former type of transform is associated with the so-called NHWC layout of the input/output activation tensors, where the four letters (N,H,W,C) respectively specify the ordering in memory of the dimensions (b, h i /h o , w i /w o , c i /c o ) of the convolution operator.In comparison, the im2row transform is linked with the NCHW layout.This is relevant because, for im2row, it is possible to concatenate two (or more) consecutive convolutional layers (with a number of element-wise layers in between) so that the output activation tensor of one convolution is directly passed as the input activation to the next one.For im2col, in contrast, the concatenation of consecutive convolutional layers requires re-arranging the output activation tensor in memory, previously to passing its data to the next layer.The key here is that the cost of these data re-organization is, in general, not negligible in a platform such as the GAP8.
A potential strategy to accelerate the execution of im2row is to employ an algorithmic variant based on A3C2B0 for gemm instead of B3C2A0.For im2row, the filter matrix corresponds to the gemm matrix operand B and, for A3C2C0, this operand can be pre-packed into micro-tiles in the main memory.Therefore, this solution can benefit from the same fast transfers reported earlier for Â in the im2col variant combined with the B3C2A0 algorithm.

Concluding Remarks
We have proposed an efficient implementation of convolution operator, for the GAP8 PULP heterogeneous multicore processor, that leverages the fabric controller for data transfers and the cores in the compute engine for the arithmetic.The GAP8 features a four-level memory hierarchy, with scratchpads instead of conventional caches, plus a fabric controller and 8 compute cores.To target this architecture, our solution transforms the convolution into a gemm, via the lowering approach; applies tiling to partition the gemm matrix operands; and orchestrates the data transfers across the memory hierarchy as part of the packing operations in gemm.In addition, the proposed approach formulates the gemm operation as an algorithm where a small block of A (the micro-tile) is resident in the vector registers of each compute core (in order to cast the innermost computation in terms of a dot product); and exploits parallelism from one of the gemm innermost loops to distribute the workload among the eight compute cores.
Our experiments on the platform, using the MobileNet-v1 model and a single input scenario show small differences in the execution time and parallel scalability between the im2col and im2row variants, though we expect the latter to be more efficient when considering the concatenation of layers that appear in a DNN, especially if it is integrated into a A3C2B0 algorithm for gemm.
As part of future work, we plan to explore the possibilities of overlapping transfers with computation via double buffering.We expect this will reduce the impact of idle times due to communication on the global performance.However, there is a delicate balance to inspect here as it also reduces the reutilization of the data stored in the buffers, a factor which is relevant due to the small capacity of the scratchpads.

Fig. 1 :
Fig. 1: Direct algorithm for the application of the convolution operator O = Conv(F, I).
returns the output activation tensor O, of dimension b × c o × h o × w o , where c o specifies the number of output channels and h o × w o are the output image height × width.

Fig. 4 :
Fig.4: The baseline algorithm of gemm B3A2C0.Here C r is a notation artifact, introduced to ease the presentation of the algorithm while A c and B c are actual buffers that maintain copies of certain blocks of A and B.

Fig. 5 :
Fig. 5: Packing in the baseline algorithm of gemm B3A2C0.Note how the entries of A, B are re-organized into A c , B c in micro-panels of m r rows, n r columns, respectively.
3) the starting address of the micro-panel B r (parameter Br) in the L1 MA of the CE; and 4) the starting address of the micro-panel C r embedded in C c (parameter Cc) in the L2 MA shared with the FC.-The code for the micro-kernel next includes the corresponding variable declarations of scalar and vector data (lines 4-6).The data type for the latter is v4s, which identifies a vector with capacity for four INT8 numbers.-The code then loads the four rows (with four INT8 numbers each) of the 4 × 4 micro-tile A r into the same number of vector registers: A0, A1, A2, A3 (lines 9-10).-At each iteration of the main loop (line 12), the micro-kernel loads one column of the micro-panel C r (four INT8 numbers) into the vector register cr and one column of the micro-panel B r (four INT8 numbers) into the vector register br (lines 14-15).-Inside the loop, the micro-kernel then proceeds to multiply the contents of the micro-tile A r with the column of B r , updating the column of C r via four dot products (lines 18-19); and storing the four INT8 elements in the column of C r back into the L2 MA (lines 22-23).

Fig. 7 :
Fig. 7: Simplified realization of the sequential implementation of a micro-kernel with an m r ×k r = 4×4 micro-tile of A resident in the processor (FC or compute core) registers for the B3C2A0 variant of gemm.

Fig. 9 :
Fig. 9: Simplified realization of the parallel implementation of loop L5 for the B3C2A0 variant of gemm.
breaks down the time spent in layer 10 of MobileNet-v1 into the different components of the algorithm: -Arithmetic (by compute cores), -Stream Cc (from L2 to registers by compute cores), -Stream Br (from L1 to registers by compute cores), -Stream A (from L3 to L1 by FC, and from there to registers by compute cores), -Copy Br (from L3 to L1 by compute cores), -Pack Cc (from L3 to L2 by FC), -Unpack Cc (from L2 to L3 by FC), and -Pack Bc (from L3 to L3 by FC); see Section 4 and Figure 8.The figure displays two plots, for the im2coland im2row-based convolution variants, and reports the execution time (in seconds) using 1, 4 and 8 compute cores.