Vectorizing posit operations on RISC-V for faster deep neural networks: experiments and comparison with ARM SVE

With the arrival of the open-source RISC-V processor architecture, there is the chance to rethink Deep Neural Networks (DNNs) and information representation and processing. In this work, we will exploit the following ideas: i) reduce the number of bits needed to represent the weights of the DNNs using our recent findings and implementation of the posit number system, ii) exploit RISC-V vectorization as much as possible to speed up the format encoding/decoding, the evaluation of activations functions (using only arithmetic and logic operations, exploiting approximated formulas) and the computation of core DNNs matrix-vector operations. The comparison with the well-established architecture ARM Scalable Vector Extension is natural and challenging due to its closedness and mature nature. The results show how it is possible to vectorize posit operations on RISC-V, gaining a substantial speed-up on all the operations involved. Furthermore, the experimental outcomes highlight how the new architecture can catch up, in terms of performance, with the more mature ARM architecture. Towards this end, the present study is important because it anticipates the results that we expect to achieve when we will have an open RISC-V hardware co-processor capable to operate natively with posits.


Introduction
In the latest years, RISC-V has started to emerge as an open-source alternative CPU architecture [4,7,27]. Being it also royalty-free, it is the rising star competitor of Intel, AMD and ARM CPUs (both for 32 and 64-bit variants). Important software and hardware industries have endorsed and funded the project, including Intel, Microsoft and ST Microelectronics [6].
The main feature of RISC-V is its open instruction set architecture. This means that any user can extend it by adding his own instructions and functionalities: this possibility is strategic to design very low-latency co-processors and accelerators without having to treat them as external devices with memory mapping and interrupts. Furthermore, with the latest advancements of the vector extension development, RISC-V processors are able to accelerate the processing of several kernels for machine and deep learning (e.g., dot products, vector-matrix multiplications and image filtering). Lately, several real number representations have been proposed by industry and research such as Intel with Flexpoint [23,26], Google with BFLOAT16 [8] and Facebook AI [22]. Another very promising alternative to IEEE 32-bit Floating-point standard is the posit TM number system, proposed by Gustafson [19]. This format has been proven to match single precision accuracy performance with only 16 bits used for the representation [9,12,16,17,24]. Furthermore, the first hardware implementations of this novel type are very promising in terms of energy consumption and area occupation [10,20,28].
In this work, we envision the adoption of these two disruptive innovations (vector extension and posit arithmetic) within the same architecture. Our ultimate goal is to extend RISC-V to be able to use a Posit Processing Unit (PPU) as a co-processor, by extending the processor instruction set architecture (ISA). While going towards this end, we can anyway gain great benefits from the posit format. It is of particular interest knowing the potential benefits of posit-based co-processor in a killer application such as Deep Neural Networks.
In this work, we assess the quality of the vectorization of RISC-V operations when using posit numbers and we compare it with ARM SVE. The paper is organized in the following way. In Sect. 2, we briefly present an overview of the posit format, along with recent improvements and findings achieved at University of Pisa, in collaboration with MMI spa, on information processing using posit numbers. In Sect. 3, we summarise the core aspects of the RISC-V architecture. In particular, we focus on the RISC-V vector extension, showing the principal component used in the rest of the work. In Sect. 4, we present the implementation of vectorized posit operations inside the cppPosit library (a C?? Posit library developed and maintained by the authors) for RISC-V, following the same approach of our previous implementation of the ARM SVE vectorized operations [14]. In particular, we focus on the implementation of posit encoding and decoding from/to the floating-point format. In Sect. 5, we illustrate the benchmarks used to test our implementation. The tests represent different core operations of DNNs, including convolutions, dot-products and matrix-matrix multiplications as well as activation functions. In particular, the proposed activation functions are fast approximated versions of the real ones; these functions can be computed just by an arithmetic-logic unit, thus being highly vectorizable. In Sect. 6, we outline our vision to enable hardware accelerators for posit operations for a RISC-V processor focusing on the latency and implementation complexity of the different solutions. Finally, we also present a comparison with ARM SVE and future works. In Sect. 7, we draw some conclusions.

Posit arithmetic
The posit format [11,12,16,19] is a configurable fixed length format for real number representation; the format configuration involves the number of overall bits (nbits) and the maximum number of exponent bits (esbits).

Format overview
As shown in Fig. 1, a posit number is composed by a maximum of 4 fields: i) sign (1-bit), ii) regime (variable length), iii) exponent (maximum of esbits) and iv) fraction (variable length). Note that posits are encoded using 2's complement. The regime field is a particular one; its length is identified by a series of bit equal to 1 (or 0) terminated by a stop-bit with the opposite value. The value of the regime field is then the number of equal bits in the so discovered bit-string.
Expression (1) shows the mathematical relation between the posit bit content (v) and the represented value (x). k is the regime value computed as described before and useed ¼ 2 2 esbits . e and f are, respectively, the exponent and fraction values decoded from the base-2 representation (Fig. 2).

Advantages over IEEE 32-bit Floats
As widely described in [18,19] several issues are afflicting the IEEE Float32 format that are addressed by this novel format: • Waste of bit patterns: IEEE Float32 wastes millions of patterns for NaN values; • Mathematically incorrect: two representation of 0 (AE0); • Non-configurable accuracy: pre-determined number of exponent and mantissa bits; • Being a more nonlinear and compressed representation, it allows more powerful bit manipulations (changing the bit string of a posit in an appropriate way, using the ALU alone, can lead to interesting nonlinear transformations of the posit number itself).
The application of posit numbers to Deep Neural Networks (DNN) has been independently proven to this authors and others, to perform as good as float numbers [15,25] with half the bits (or even less), as reported in Table 1. The table reports the comparison between different configurations of posit with a 32-bit float, on the German Traffic Sign Recognition Benchmark (GTRSB) dataset. In the Table 1, a 10-bit posit arithmetic allows for the same detection and classification accuracy of 32-bit float, while

No exponent bit case
As shown by the authors in [13], the posit format gains interesting properties when configured with esbits ¼ 0. This particular configuration allows the implementation of fast versions of common operations (a little approximation is introduced in some of them); the new versions can be computed just by using the arithmetic-logic unit (ALU) of the CPU since they only involve bit manipulation and integer operations. Among the basic operations that can be accelerated this way, there are the double and half operators (2x and x/2), the inverse operator (1/x) and the one's complement (1 À x). Furthermore, some common use activation functions for Deep Neural Networks (DNNs) can be implemented this way. The basic building block is represented by the fast approximation of the Sigmoid function (from [19]). Starting from this one, we developed the others as a simple combination of the previous operators and the Sigmoid.
What we obtained is the fast approximation of the hyperbolic tangent (FastTanh [11,13]): and the fast approximation of the extended linear unit (FastELU [13]): The possibility to implement several operations as simple ALU instructions leads to some interesting aspects: • It allows the ALU emulation of posit operations without specific hardware support. • It allows an operator to be implemented as a sequence of ALU instructions; this means that every operation implemented this way can be easily vectorizable.

Past achievements concerning posit-based DNNs
In previous work, we have been able to: • develop our posit software library (cppPosit, see Sect. 4); • implement fast approximated activation functions, only possible when using posits, which exploit the ALU (no PPU necessary for it): [13,15]; • fast matrix-vector multiplication, thanks to vectorization (demonstrated on ARM CPUs, using SVE: [14]).
In this paper, we aim the present what we did to obtain the same achievements, but this time on an open processor, the RISC-V.

The RISC-V architecture
The RISC-V [4] architecture is a modular, open-source and royalty-free instruction set architecture (ISA) and comprises both 32 and 64-bit flavours. The overall ISA is composed of smaller sub-ISAs among which there are the base subsets. These subsets are referred as base integer instruction sets and identified by the letter I. Besides, a RISC-V-based architecture can present some other extensions; some of them are referred to as frozen. This means that their encoding and behaviour has been ratified and will not change during the current draft of the architecture.

The RISC-V vector extension
A very interesting and under development extension is the vector one (V). This extension aims to provide single-instruction multiple-data (SIMD) capabilities to the RISC-V architecture. By design, this extension can seamlessly exploit either the CPU registers or a special vector coprocessor for hardware acceleration. Any RISC-V-based The associated real value to the shown posit is: þ256 À9 Á 2 6 Á ð1 þ 2=4Þ ¼ 2:0329 Á 10 À20 : architecture implementing this extension will define some parameters: • Number of vector registers (standard is 32) • vlen: size (in bits) of the vector registers (e.g., 256) • elen: maximum supported size for a single element (e.g., 64 for a 64-bit integer or double) The idea behind the vector extension is the same of the ARM scalable vector extension (SVE) architecture [1]; there is not a predetermined vector length (as happens in the Intel SIMD extensions) but a special instruction vsetvl. This instruction takes as input a requested vector length vreql and returns the granted vector length vgrant as in next expression: This design allows porting an application between RISC-V architectures, without re-writing a single line of code and, in case of furtherly compatible architectures, without recompilation. Moreover, this will help us later when simulating the same program with different vector configurations.

The cppPosit library
The support for posit arithmetic is offered by the cppPosit library. This library has been developed in Pisa, and it is maintained by the authors of this work. The library uses advanced templatization techniques from C??14 to ease the definition of posit configurations at compile time. Posit operations are classified into four different levels (L1-L4) with increasing computational complexity [13]. The simplest and fastest level is called L1 and comprises all the operators described in Sect. 2.3. The library also offers three back-ends to rely on for posit operations that cannot be emulated via ALU: •

Posits and RISC-V vectorization
Vectorization of posit operations was already proved to be interesting in the ARM SVE environment in [14]. As shown in that work, each function aimed at vectorizing posit operations has three main parts: i) prologue, ii) body, iii) epilogue.
In the prologue, we need to prepare the vector containing posit data referring to the underlying vector architecture. This phase varies whether we are implementing L1 function or not. In the former case, the prologue is just a reinter-pret_cast from the posit type to the underlying integer holder type. Instead, when dealing with non-L1 function, this phase is also devoted to the conversion from posit type to a suitable back-end type (e.g., for RISC-V we choose to convert it to IEEE Float32). In the body, there is the actual implementation of the vectorized operation. Finally, in the epilogue, we need to build back the posit vector (the same considerations of the prologue hold here).
An important focus must be put on the epilogue and prologue phases: since these phases employ posits with an overall size of 16 or 8, we are performing a data compression by a factor 2 or 4. Moreover, even if we use 32-bit floats to perform computations, the data expansion is performed only inside the vector processing unit. Therefore, we are transferring compressed data from the scalar CPU registers to the vector ones. We can thus transfer from twice to quadruple the data with a single vector load instruction. This means that just by using posits as a compressed information storage can lead to great optimizations in data transfer. Figure 3 shows the overall idea behind posit vectorization in cppPosit.

Experimental results
In this section, we will present the benchmarks for the novel RISC-V-''V'' vectorized approach on several benchmark kernels. Instead of relating our benchmarks to a particular DNN library or implementation, we decided to present the benchmarks on simple and core building blocks for DNNs.
Firstly, we will present benchmarks of vectorized L1 activation functions such as sigmoid, hyperbolic tangent and extended linear unit. Secondly, we will present benchmarks on matrix and vector operations such as dot product, convolution and general matrix-matrix multiplication.
The RISC-V benchmark binaries were generated using the Barcellona Supercomputing Centre (BSC) LLVM cross compiler (clang?? 11.0). As for now, this is the only compiler providing high-level intrinsics for the RISC-V vector extension [3]. The RISC-V binaries were then executed on the RISC-V Spike simulator (riscv-isa-sim RVV version 0.8) [5].
The ARM benchmarks binaries were generated using the upstream branch of the GCC compiler (GCC 10.0) with SVE intrinsic support. The examples were executed on a static, user-space QEMU (QEMU 5.0) installation for ARMv8.2 that supports SVE instructions.
All the simulations were executed on a 8 core Intel(R) Core(TM) i7-9700 CPU with 3.00GHz base frequency.
For each benchmark result, we analyzed the timing performance of prologue, body and epilogue to address the overhead that is introduced by the first and the last part (as already discussed in Sect. 4.1. Moreover, we compared the vectorized performance varying the vector length against the non-vectorized implementation (called naive from now on).

Vectorization of posit encoding and decoding
Since we could not rely on auto-vectorization due to compiler limitations there was the need to provide a vectorized implementation of posit decoding and decoding operations for prologue and epilogue blocks. Completely decoding a posit means taking its representing integer and extract the four fields described in Sect. 2.1. This can be accomplished using just arithmetic and logic integer operations. Algorithm 1 shows the steps we used to unpack a posit into its (at most) four fields. Given these fields, we can build a float number.
For the conversion, we want that x P ¼ x F : This implies that: Encoding a posit means taking any real number representation (e.g., IEEE Float32) and computing the integer that represents the posit as described in Sect. 2.1. This can be accomplished using just arithmetic and logic integer operations. Algorithm 2 shows the steps we used to unpack a posit into its (at most) four fields. Once we decode the floating point into its 3 fields we can reason on how to retrieve posit fields. If we look at Eq. (2), we can see that k P is the result of an integer division between es F and 2 esbits and es P þ 127 is the remainder of the integer division. This means that we can retrieve the two fields as: Now, we consider the fraction part. As shown in Eq. (3), retrieving m P is equivalent to a logical right shift of 23 À fracbits on m F , thus obtaining the fracbits most significant bits.
We implemented both algorithms using hand-vectorization for the RISC-V platform, using the ''V'' extension intrinsics. The findLeftMostSet is particularly interesting, since there was not a native vectorized implementation in the RISC-V ''V'' extension at the moment of the development. Algorithm 3 shows our choice for the implementation. We implemented also this algorithm using the RISC-V vectorization intrinsics.

Vectorized activation function benchmarks
In this subsection, we report in Fig. 4 and 5 the results of the benchmarks concerning the vectorization of the activation functions (Sigmoid and ELU, respectively). These benchmarks consisted in the execution of the vectorized activation functions on random vectors of 8192 items varying the vector length and using Posit\16; 0 [ and Posit\8; 0 [ . These results are particularly groundbreaking; we are indeed reducing the processing time of an activation function by a factor 10 at least. This is easily obtained by just applying posit properties. Moreover, there is nothing similar that can be obtained with the floating point format to achieve similar speedup while varying vector length and data size. This is extremely important, since modern neural network architectures can require the computation of nonlinear activation functions like ELU on vectors with up to tens of thousans of elements.

Vectorized matrix-vector operation benchmarks
In this subsection, we show the results concerning vectorized matrix-vector operation benchmarks. The benchmark parameters are the following: • Dot product: timing performance on a dot product between vectors of size 64.
For each benchmark, we report two different types of result: • Performance comparison (Figs. 6, 7, 8): timing performance comparison varying the vector length and disabling vectorization. We used a logarithmic scale to represent these results in order to increase plot readability. • posit decoding/encoding overhead evaluation (Figs. 9, 10, 11): timing performances are decomposed to evaluate the contribution of each part of the algorithm.

Analysis of results and discussions
In this section, we will analyse and discuss the results shown before. Firstly, as reported in Figs. 4 and 5, the speed-up gained by the vectorization of activation functions is impressive in both cases, with vectorized performance being one order of magnitude smaller than the non-vectorized ones. Moreover, it is evident how much the algorithm benefits from increasing the vector length and halving the data size. Note that this speed-up could not be achieved using the IEEE Float32 format; in fact we could apply the series of arithmetic and logic operations thanks to the posit representation with zero exponent bits.
Secondly, speaking of matrix-vector operations, again the speed-up when using vectorized algorithms is massive, reducing processing time by more than 2 orders of magnitude (note the logarithmic scale of the plots in Figs. 5, 6 and 7).
Thirdly, the same benchmarks were executed on the ARM environment explained at the beginning of this section. In particular, we considered the 512-bit vector length for the comparison to match the first hardware ARM SVE platform on the market (the Fujitsu A64FX processor [2]). As shown in Fig. 12, if we compare the vectorization on the  Finally, the newly proposed approach for decoding and encoding vectorization of posits brought a substantial speed-up in the prologue and epilogue phases. As we can see from Figs. 9, 10 and 11 the vectorized phases benefit from increasing the vector register lengths. However, there is still some overhead that afflicts the epilogue part due to costly conversions between 32-bit integers (for Float representation) and posits holder integers (16-bit integers in this case).

Future work
Having a dedicated posit processing unit is critical to increase the architecture performance removing the software emulation bottleneck. There are three main ways to equip a RISC-V CPU with a hardware PPU: 1. include the PPU within the processor [21]. This requires the introduction of an additional instruction set and the instrumentation of the compiler. 2. use the PPU as a slave peripheral. The peripheral can be an external FPGA board connected via PCI Express bus. This is similar to how GPUs are connected to CPUs nowadays. This solution has the highest latency, since we need to communicate with an external peripheral via bus communication. 3. design the PPU as an IP-core to be included within the processor chip. This can be viewed as a co-processor approach, thus having PPU and CPU on the same SoC (system-on-chip). In terms of latency, this solution is an intermediate one.
We are currently working to implement a PPU for RISC-V following option 1. However, even without a hardware PPU, the solution proposed in this paper (vectorized posit operations, emulated on ALU of FPU) is still of interest in particular situations, as discussed below. Figure 13 shows some computing environments, with increasing computing power, cost and energy consumption: i) micro-controllers, ii) CPUs without the FPU, iii) CPUs with the FPU, iv) Many Core CPUs without GPUs, and v) Multi (or Many) Core CPUs with GPUs. With ''Many Core'' CPU (MaCPU), we designate a CPU having more than 100 cores, while with ''Multi Core'' CPUs we indicate those having up to 100 cores.
The solution proposed in this work is of interests for cases ii), iii) and iv). Indeed, In case ii) the use of ALUemulated Posit-DNNs is particularly interesting and justified (see [13,15]). Also in case iii), the approach proposed here is a viable and appealing solution, provided that the CPU supports vectorization. Case iv) is the situation where the solution proposed here is expected to obtain the best speedup, since we can exploit the massive data and instruction parallelism introduced by many core architectures to increase the DNN processing capabilities.
In all the cases, the use of posit numbers can at least halve (when using posit\16; x [ ) the representation size, thus doubling the bandwidth of the information and improving the usage of the cache. Moreover, reducing the information size is extremely useful when combined with the vector engines. With a half of the element size, we can fit double the elements inside a vector register, as demonstrated in [14] for ARM CPUs and here for RISC-V CPUs. Finally, when posit\8; x [-based DNN reaches a satisfying accuracy, the benefit with respect to 32-bit floats is much more impactful.

Conclusions
In this paper, we presented the implementation of posit vector operations for DNNs using the RISC-V open-source hardware platform. We proposed an extension of our cppPosit library enabling dot-product, matrix-matrix and convolution operations in the RISC-V Vector extension environment. Moreover, we provided the implementation of fast approximated activation functions only using vectorized integer arithmetic and logic. As reported in the experimental results, we gained a significant speed-up from the hand-vectorization of posit operations, including the encoding and decoding of the novel format to the standard IEEE 32-bit floating-point format. Furthermore, we managed to catch up the ARM SVE results when vectorizing the same operations. The promising results may indicate that open-source hardware platform like RISC-V, along with open-source DNN software implementations, may  Acknowledgements The present study has been possible thanks to the support of the team of the Barcelona Supercomputing Center dedicated to RISC-V extension. We have conducted this work with their invaluable support, as a common partner of the same H2020 EPI project. To them goes our most sincere appreciation.
Funding Open access funding provided by Università di Pisa within the CRUI-CARE Agreement..

Compliance with ethical standards
Conflicts of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. Fig. 13 Representation of different computing environment, with increasing computational power. In some configurations above, the FPU can be absent (to save energy and/or production costs as with the ARM Cortex Series M). So that the use of posit with an ALU backend is an attractive solution. There are also situations where the FPU is present but the GPUs are absent: again, in such situations (e.g., ARM Cortex Series A), the use of ALU-emulated (or even FPU-emulated) posits is a viable solution to speedup DNNs or save memory. In some situations, the user can have a CPU with a high number of cores (they are more and more affordable, today-the AMD EPYC 7000-Series, for instance, can have 128 cores) and no GPUs: even this one is an interesting case where ALU/FPU-emulated posits can speedup DNNs. Of course, having a hardware PPU provides higher performance levels and would allow the use of DNNs of all the types, without restrictions, as discussed in Sect. 6. In the picture, MaCPU stands for many-core CPU, i.e., a CPU with more than 100 cores